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STUDIES OF ACHIEVEMENT TESTS? 
BEN D. WOOD 
Columbia College, Columbia University 


I. The right versus R-W Method of Scoring True-false Tests 
II. The ‘‘Internal Constitution” versus the ‘‘External Form” of Examination 
Questions 
III. Spearman-Brown Reliability Predictions 


I. Tue R versus THE R-W Metuop or Scorine “Do Nor Gugss”’ 
TRUE-FALSE EXAMINATIONS 


All the correlations? reported in this study are Pearson Product- 
moment coefficients to which no corrections have been applied save 
the usual correction for guessed averages. WN is 100 for all coefficients 
except the validity coefficients of the Pleading Test where N is 74. 

The experimental data presented here are derived entirely from 
true-false tests in which the students were directed not to guess. Our 
data apply solely to the question of the relative merits of two ways of 
scoring true-false examinations in which the students were strictly 
warned against the dangers of guessing. The question of the relative 
merits of ‘“‘guess” and “‘not guess” directions does not arise in this 
study except by inference. It is just as well to notice further that the 
data strictly apply only to the types of true-false questions contained 
in particular tests, made up for specific purposes and administered to 
particular groups of students. This warning seems desirable because 
we do not wish to over-generalize about groups of questions in the 
same subject-matter which may be of the same external form, and yet 





1 These studies were made possible by three grants from The Commonwealth 
Fund. 
2 Acknowledgments are made to Miss Helen Green and to Mrs. E. S. Flood, 
Assistants in the Bureau of Collegiate Educational Research in,Columbia College, 
for careful work in calculating and checking all the correlations, for making the 
charts, and for other indispensible assistance. 
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be very different kinds of questions as regards what they actually 
measure and how they behave statistically. 

Our data are derived from examinations in French, Law (Pleading 
and Practice, Torts, Equity, and Property courses) and Medicine 
(Anatomy), which were constructed and administered as follows: 

1. The French test was designed to measure comprehension of 
French prose and included 100 true-false statements asserting com- 
monly known truths or fallacies, such as “‘Dogs are animals, White 
is black, The moon never shines in the U. S. A., More rainfall occurs in 
deserts than in forested areas, All persons who ride on trains are over 
fifty years of age, Some persons do not appreciate the advantages of 
education as well as others do,” etc. The statements ranged from very 
short and easy to fairly long and complex sentences, but the assertions 
in them were all within the comprehension of students intelligent 
enough to study French. These statements were made up mainly of 
words within the most frequently used 1000, as determined by vocabu- 
lary counts of 16 widely used textbooks; very few words were taken 
beyond the second thousand. The items were arranged in a random 
order. The directions and sample questions given to the students 
were as follows: 


Directions to Students.—The following statements in French are either true or 
false. Indicate by a plus sign (+) a true statement, by a zero (0) a false state- 
ment. Put the plus signs and zeros on the short lines at the left. Do not spend 
too much time on any one statement. Do the easy ones first. Do not guess; 
the chances are against you in guessing; wrong answers count against you more 
than omissions. 


+ ....1. Les gants couvrent les mains. 
0 ....2. Les chiens ordinairement n’ont pas d’oreilles. 


This test was given in September 1923 as part of a two-hour place- 
ment examination to 350 students who had just been admitted to 
Columbia College. Thirty minutes were allowed, but no student spent 
more than 28 minutes; one finished in 15 minutes, and a dozen in less 
than 20 minutes. The general purpose of the placement examination 
was explained, so that it is practically certain that every student put 
forth his best effort. The students realized that they might either 
lose or gain admission or college credit. The group must be considered 
as highly selected, even in comparison with other college freshman 
groups.! They averaged about 84 on the Thorndike Examination, 





1 Wood: “Measurement in Higher Education.”” World Book Co., Yonkers, 
N. Y., 1923. See Table V. 
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with Standard Deviation about 12. The 100 papers used in the present 
study were selected from among these 350 students’ papers by a purely 
random method. 

2. The four law examinations were given in June 1924 as parts of 
the final (and only) examinations in three Freshman courses and in one 
Sophomore course in the Columbia University School of Law. The 
numbers of true-false questions in them were as follows: Pleading 180, 
Property 200, Equity 140, Torts 130. Equity is a second year course. 
Each examination was made up with some technical assistance from 
the writer, by the Professor of Law who taught the course for which 
the examination served as a close. The questions were designed to 
measure not merely legal information and rules of law, but also 
“reasoning ability” and ‘“‘ power to apply knowledge of the law success- 
fully to complicated and knotty legal problems.” That these new- 
type law examinations did measure this “reasoning ability”’ better 
than the old-type law examinations is amply shown in the analyses of _ 
the results. Although the numbers of questions range from 130 to 
200, the time allowance was practically the same for all four examina- 
tions, between 90 and 120 minutes. Most of the students voluntarily 
handed in their papers as finished at the end of 90 minutes; nearly all 
were handed in at the end of 100 minutes; only a half-dozen stayed the 
full time allowed, 120 minutes. The three Freshman examinations 
were given to the same 225 men constituting the Freshman class; 
the Equity examination to the 150 men who took this course in 1923- 
24. The 100 papers from each of these four sets of examinations used 
in the present study were selected by a purely random method. Since 
the types of questions in these four examinations are similar, illustra- 
tions will be taken only from the Property examination, in which the 
questions are more problematic than in any one of the others. The 
directions to the students were identical in all four. The numbers 
preceding the following questions are the numbers which they had in 
the examination. 


CoLuUMBIA UNIVERSITY IN THE City oF New Yor«K 


School of Law 
Final Examination 1924 
Property I 
Directions: Read these ‘statements and mark each one at the left of its num- 
ber with a plus sign if you think it is true; with a zero if you think it is false, wholly 


1 Wood: Measurement of Law School Work, I and II. Columbia Law Review, 
March, 1924, and March, 1925. 
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or in part. Each statement correctly marked gives you a credit of one point; 
each incorrectly marked statement counts as a penalty against you, and is sub- 
tracted from your score; omitted statements do not count against you as pen- 
alties but reduce your total score. 
Your score will be based upon plus signs and zeros; don’t waste time writing 
anything else. 
First go through the list and mark all that you know for certain at once; then 
go back and study out the harder ones. 
Do not guess. The chances are against you in guessing. A wrong response 
counts more heavily against your score than an omitted question. 
Read each statement carefully. The truth or fallacy of a statement often 
depends upon the presence or absence of a single word. Ask no questions. [If in 
doubt use your own judgment and go ahead. 
apes, 1. A tenant in capite was necessarily a holder of one or more large parcels of 
land. 

jak wale 2. The feudal incident of relief attached to socage tenure. 

joel seam 3. The Statute of Quia Emptores increased the value of holding land. 

aay 4. In several of the United States a limitation by A to B and the heirs of 
his body has most of the legal consequences which such a limitation 
had in England prior to De Donis. 

ee 5. In some jurisdictions the rights of the husband in the realty of his wife 
are barrable by will of the wife but not by her deed. 

A by feoffment with livery conveyed Plot X to B and his heirs to the 

use of C and his heirs before the Statute of Uses. (Facts for questions 
26-30.) 

adele 26. If C died without heirs, A was entitled to a resulting use. 

cna 27. If B died without heirs, C lost all his rights in equity. 

mM. 28. If C died leaving a wife, M, M was entitled to a dower interest in the 
equitable fee simple which had been C’s. 

oan 29. B had the legal ability to destroy all of C’s rights to Plot X. 

waa 30. C at any time in the 16th century and before the Statute of Uses, could 
by his own act convey the legal title to Plot X. 

....164, The English legislation of 1660 increased the value of holding land. 

..165. The likelihood of escheat is increased by statutory enlargement of the 

class ‘‘heirs.”’ 
A to B and the heirs of his body, remainder to C and his heirs. At the 
present time in New York. (Facts for questions 166-168.) 
...166. If B died survived by an heir, C can never take. 

. .167. B has the power to convey a fee simple absolute at any time. 

..168. If B dies without issue him surviving, but leaving a wife D, C can imme- 

diately convey a fee simple absolute. 

..190. Where the Statute Quia Emptores is in effect no tenure can exist between 

the grantor and grantee of an estate of inheritance. 

..200. In determining the priorities of two recorded deeds in most jurisdictions, 

it is significant whether the deeds are warranty deeds or quit claim deeds. 


3. The Examination in Anatomy was constructed by Dr. Mather 
Cleveland, Associate in Anatomy in the College of Physicians and 
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Surgeons, Columbia University, and administered in June 1925 to 
the Freshman Class, numbering 100 students, as part of their final 
examination in anatomy. It included only 130 items. Registration 
is strictly limited in the College of Physicians and Surgeons and in 
this anatomy class we have a group that is as highly selected as the 
law classes described above, perhaps more highly selected. The 
directions to students and five random questions are reproduced for 
illustrative purposes. It will be noticed that in this examination the 
instruction “not to guess” is not emphasized. This circumstance may 
account for the indecisive results reported below for this examination. 

Directions: Put a plus sign (+) on the dotted line at the left of each state- 
ment that is true, and a zero (0) at the left of each that is false. Do not guess. 
cial 7. The deep branch of the radial nerve on the back of the forearm lies super- 

ficial to the abductor pollicis longus. 

..28. The lateral antebrachial cutaneous nerve is a branch of the radial nerve 

and supplies the lateral aspect of the forearm. 


. .62. On cross-section of the middle one-third of the leg the tibialis posterior 
lies behind the flexor hallucis longus. 

..78. The lower extremity of the ligamentus pulmonis is reflected on the dia- 
phragm. 

. .87. Both lungs are entirely covered by pulmonary pleura except at the hilum 
and the space between the two layers of the ligamentum pulmonis. 


COMPARATIVE RELIABILITIES OF R anp R-W Scores 


One hundred papers from each of the sets described in (1) and (2) 
above were scored as follows: The questions were divided into random 
blocks of ten, and each block was scored by the Number Right method 
and then by the Right-minus-Wrong method. Thus, in the 100-item 
French test, there were ten blocks of ten true-false statements each, 
and each of these ten blocks had a Number Right score and a R-W score. 
In the 200-item Property examination we had 20 blocks of ten true- 
false questions each, each of which 20 blocks had a R and a R-W score. 

The scores of these blocks of tens were then combined so as to 
produce scores on a series of blocks of twenty questions each. By a 
similar process R and R-W scores were produced for blocks of 30, 
of 40, and so on up to 100 questions each in the case of the Property 
examination. Thus we are enabled to calculate the empirical reli- 
abilities not only of random halves of these examinations, but also of 
random tenths, of random two-tenths, of random three-tenths, etc., 
which enables us to compare the reliabilities of R and R-W scores all 
along the line, and to study the relative rates of growth in reliability 
of R and R-W scores as the number of items increases. 
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We will consider the French test results first. Correlations were 
calculated between RF scores of 8 random pairs of ten-question blocks, 
and between R-W scores of the same 8 pairs of tens. (For con- 
venience the blocks will be called 10’s, 20’s . . . 100’s, according to 
the number of questions in them.) Then r’s were calculated between 
the R scores of ten pairs of 20’s and between the R-W scores of the 
same ten pairs of 20’s. Similarly, for three pairs of 30’s, for four pairs 
of 40’s, and for four pairs of 50’s. The averages of these various 
classes of r’s are presented in Table I. 


TABLE I.—FReENcH Test RELIABILITIES 


Showing average reliability coefficients of various blocks of questions in a 100- 
item JF test of French comprehension, when Score = Rights and when 
Score = R-W. The K’s in the last line are derived from the average 
r’s in the preceding line by the formula K = +/1 — r? 


No. of items in blocks... 10 20 30 40 50 
No. of r’s averaged..... 8 10 3 4 4 
Method of scoring..... . R R-W R RW R R-W R RW R R-W 
BUEIET onc ccc cccccas .3871 .344 .577 .512 .685 .583 .748 .702 .830 .802 
ie we dat ak a die ve ae ea .930 .940 .817 .860 .728 .813 .664 .712 .558 .597 


It is obvious that the four pairs of 50’s are merely different com- 
binations of the same 100 items, and that the four r’s are far from 
being independent determinations of r;,. (Of course, the same items 
were never included in both scores correlated.) This criticism also 
applies with slightly less force to the r’s of the 40’s and of the 30’s. 
But the differences between the r’s are sufficient proof that they have 
some value for smoothing and stabilizing our figures. The reader 
will recall this reservation in connection with the r’s for the 50’s, 
60’s . . . 90’s and 100’s of the longer examinations. Since the pro- 
cedure with all examinations was identical, Tables II-IV are presented 
without further explanation. 


TaBLeE IJ.—PLEADING AND Practice Test RELIABILITIES (SEE TaBLeE I) 


No. of items in blocks........ eee 10 20 30 40 
No. of re averaged........./..... 16 7 15 6 
Method of scoring................ R RW R RW R R-W R R-W 
RS EER ot AR Ee ne re .338 .313 .502 .415 .611 .530 .695 .586 
Bh bk oes Glin coca OPM e et oa dt .942 .950 .866 .910 .792 .849 .719 .811 
a ee ees eee ere 50 60 90 
RAT ae Pee 3. 3 4 
DE SELENE SRS CT CNT R R-W R R-W R R-W 
Oi ns. eh a one on eas ha Weel .707 .640 .747 .693 .829 .771 
Deis ask an A 6h aie CL RGae aie oe 6h. cle aaa .707 .768 .664 .721 .559 .637 





1 See next issue this journal for Table XII (article continued). 
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TasLe III.—Property Test Rewiasrities (See Tase I) 


No. of items in blocks... 10 20 40 80 100 
No. of r’s averaged... .. 8 6 6 a: xa 4 
Method of scoring...... R RW R RW R R-W R R-W R R-W 
ree .246 .212 .432 .386 .566 .546 .683 .726 .752 .758 
MILs x an Chap daw eee .970 .978 .902 .923 .825 .838 .731 .688 .659 .652 
TasLe 1V.—Egquity Test RewiaBiuities (See Tasie I) 
No. of items in blocks............. 10 20 30 40 
No. of r’s averaged............... 12 21 6 3 
Method of scoring................ R RW R R-W R R-W R R-W 
660s s koa tesa wane een .183 .170 .365 .304 .399 .386 .441 .444 
ae Sel cs Che coed 4h weeded .984 .986 .932 .954 .918 .924 .898 .897 
OE 50 60 70 
EE ies oiwce be vsads ieee cits 4 4 4 
cs cee hdebers sbuseesene R RW R RW R R-W 
NS bias s al olla aw hd: hiro Si tied kala .619 .529 .655 .602 .663 .650 
LN iia a die oo i alls Wi ace a ead .786 .849 .756 .798 .749 .760 


In order that comparisons may be made conveniently in all ranges 
of r, we have turned the averages of the various classes of r’s into k’s, 
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Cuart 1.—French test reliabilities showing in terms of K the comparative relia- 
bilities of R and of R-W scores, when the number of 7-F items is 10, 20, 30, 40 and 50. 
Smooth line is for Rights and dash line for R-W. 
by the formula k = coefficient of alienation = +/1 —r?. These 
k’s are presented graphically in Charts 1-4. 
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As far as reliability is concerned, these charts clearly favor Number 
Right scores in two cases, slightly favor them in one case (Equity), 
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CuarT 2.—Pleading and practice test reliabilities. (See Chart 1.) 
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CuartT 3.—Property test reliabilities. (See Chart 1.) 


and produce a stalemate feeling in another case. In no case does the 
Number Right score suffer by comparison with the R-W score, and 
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in only one case does R-W compare at all favorably with the Number 
Right as to reliability. 

It should be noted that ten of the French true-false statements 
were later thrown out of the test because they were thought to be 


ru 




















* 20 40 60s * No. of ifems 
i2 23 © 3 4 4 4#WNo.ofrs averaged 
Cuart 4.—Equity test reliabilities. (See Chart 1.) 


ambiguous by the French scholars who aided in the revision of the 
test. Whether these ten items, which were undoubtedly ambiguous, 
had a differential effect on the reliability of the two types of scores 
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Cuart 5.—French test validities showing in terms of K the comparative validities 
of R and R-W scores on various blocks of questions in a 100-item 7-F test of French 
comprehension. 


is a matter of speculation. It seems clear that, as far as reliability 
as here defined is concerned, the Number Right scores can absorb such 
imperfections much more easily than the R-W scores can. 
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The resder should notice that exact comparison between the 
absolute magnitudes of the k’s is possible only between the Pleading 
and Property results. The French test results are from the responses 
of Freshmen just admitted to college; Pleading and Property from 
the first-year class in Columbia Law School; and Equity from the 
second-year law class, which normally consists of about 70 per cent of 
the freshman class, plus a handful of advanced standing students 
from other law schools. In terms of general intellectual ability, then, 
the French students have the greatest variability and the Equity 
students the least. 

Another reason for the small magnitude of the Equity r’s is the 
fact that 15 per cent of the statements were later found to be ambigu- 
ous, and an additional 13 per cent questionable for a variety of reasons. 
A possible reason for the difference between the magnitudes of the 
Pleading and Property 7’s is that the Pleading examination is much 
more in the nature of an information test than the property examina- 
tion. The same may be said of the French test; there is less room for 
problem-solving in the French questions and in the Pleading questions 
than in the Property questions. It is at least interesting to observe 
that if we arrange these charts in the order of ascending problematic 
qualities of the questions involved in each, French, Pleading and 
Property (omitting Equity because of its known serious defects), 
the magnitudes of the r’s decrease. It is possible, though not clearly 
indicated, that the approach of the R and R-W curves in Chart 3 
as opposed to their clear separation in 1 and 2 is due to the fact that 
the Property Examination is made up almost exclusively of “‘problem”’ 
questions, while the French and Pleading tests are very much nearer 
to the information type of questions with which Toops and Ruch 
experimented.! 


COMPARATIVE VALIDITIES OF R AND R-W TRUE-FALSE SCORES 


The criteria used to test the validities of R and R-W scores of 
these tests were as follows: 

1. The French criterion was the rest of the French placement test 
mentioned above, and included a 100-item vocabulary test (French 
word followed by 5 English words), 15 graded idioms in short phrases 





1Toops: Trade Tests in Education. Teachers College, Columbia University, 
1921: Contributions to Education No. 115. 
Ruch: “Improvement of the Written Examination.’”’ Scott Foresman and 
Co., 1925. 
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to translate into English (credits 2, 1 and 0), and 30 graded French 
sentences to translate into English, each translation being graded 
objectively on a basis of ten credits for completely correct translations 
and 1 off for each of a series of specified errors. The reliability of 
this criterion, estimated by the Spearman-Brown formula was about 
0.93. 

2. The Law criterion for the Pleading and Practice, Property and 
Torts examinations was the average of all the first year law marks. 
These were six in number: two were based on three-hour essay exam- 
inations and each of the remaining four was based upon the average 
of a two-hour essay examination and of a 7-F examination of from 
130 to 200 items. The R-W scores of each of the 7-F law examinations 
here studied are included in the criterion; but since these were in each 
case first averaged with an old type examination and this average 
further revised by class records, the possible vitiation seems negligible. 
The inclusion of these four 7-F examinations in the criterion is cer- 
tainly justified from the viewpoint of strengthening the criterion; 
that it has introduced no serious spurious r in favor of R-W is indicated 
by the differences in favor of R-W total scores when the criterion 
included only the old type examination results.! 

3. For the anatomy examination we have used three criteria; (a) 
the average of the final grades in all first year courses save anatomy; 
(b) the average of three old type one-hour examinations given in 
anatomy during the session; and (c) the score on a completion test of 
over 200 items given as part of the final examination in anatomy. 

Against these criteria we measured the validities of R and R-W 
scores of various blocks of questions in the various examinations by 
calculating the correlations described in Tables V—-IX. 


TaBLeE V.—FrRENcH TEsT VALIDITIES 


Showing average validity coefficients of various blocks of questions in a 100-item 
T-F test of French Comprehension when Score = Rights and when Score = 
R-W. The K’s in the last line are derived from the average r’s in the 
preceding line by the formula K = +/1 — r? 


Se MRT ond eine ceeeesesecsenus 20 50 Total (100) 
is, eae bie ss oun bee ets 4 4 1 

SS PRESS FIT PPE Te R R-W R R-W R R-W 
TT is Rie Eek ira 'bia kbp Ae ke.0.4 » Ad Ads docu bc. .587 .569 .678 .667 .706 .747 
DM ahd ies ee Sepa bagel W400 s.calbeas 64% .810 .822 .734 .746 .708 .667 





1See Table X and Chart 10. 
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TaBLE VI.—PLEADING AND Practice Test Vauipities (Sze TaBie V) 


30 40 

4 4 
R RW R R-W 
.647 .658 .711 .706 
.763 .753 .703 .708 
140 Total (180) 

4 l 
R RW R R-W 
.815 .866 .845 .868 
.580 .500 .535 .497 


20 40 

4 4 
R R-W R R-W 
.606 .594 .669 .709 
.796 .804 .743 .705 


160 Total (200) 
4 1 
R RW R R-W 
.806 .874 .819 .892 
.592 .486 .573 .452 


20 40 

4 4 
R R-W R R-W 
.593 .576 .649 .671 
.806 .818 .761 .742 
100 =Total (130) 

4 1 
R R-W R R-W 
.736 .807 .761 .815 


No. of items in blocks............. 10 20 
No. of r’s averaged............... 4 4 
Method of scoring................ R R-W R R-W 
PET sc cisbscccceceeseacccscss Se ee ae ee 
abe nh hw adie a hagas 6a 5.206 e8 eo 6 .831 .858 .714 .819 
No. of items in blocks............. 60 100 
No. of r’ssaveraged............... 4 4 
Method of scoring................ R RW R R-W 
pO ee se 
Pl aa aed + oath aadtentsdunbaicne< ++ .688 .658 .611 .578 
TaBLe VII.—Property Test Vauipities (SEE TABLE V) 
Det OE POU IIR, Sg cid isles 6c as eowmes 10 
ES ee 4 
a es en ede se new ah R R-W 
CU Tie wate ala ek ie ye 4's 9 bar wae .445 .474 
Bee. Ge Gham ii Dees... 2... eee 80 
pe ere or 4 
3s 5 iis gsaina wdo'0 0.0 00 4:00 #0 R R-W 
st os aeiteede's. 6b ordis,ne vn's 40 44S .741 .797 
PENS ARERR Pg Ree ra oP a .672 .604 
TaBLe VIII.—Torts Test Vauipities (Sze Taste V) 
Se MN I ccc sec saaweceesees 10 
I i a so ease 5 8a 4 
og ee ee Det ous e ceed R R-W 
NG aid Cea ME iow awe Woe be bibs sos alee .552 .473 
Nisin ekebln Mak Pee keke oUt so oo Sigal .834 .882 
ere ree 80 
a I a hivg pcos tates ws co ne cee 4 
OR eer R R-W 
ten ons Siu « bt ohisin es 80% .670 .760 
Ue ik neh dia bis ne alae eed +e a eee .742 .649 


.677 .591 .649 .581 


TaBLeE [X.—ANnatomy Test VALIDITY 


Correlations of Total Scores, R and R-W, on a 130-item T-F test in Anatomy with 
each of three criteria of achievement in Anatomy and in first year medical 


college work 








Criteria Rights R-W 

Average 3 I-hr. old type exams. in anatomy....... r 0.654 0.632 
k 0.756 0.775 

Average all Ist. yr. medical course grades, save r 0.649 0.640 
anatomy. k 0.761 0.768 
200-item completion test given as part of final ex- r 0.766 0.776 
am. in anatomy. k 0.643 0.631 
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The averages of the various classes of r’s were turned into k’s and 
plotted in Charts 5-10. 
These charts show that: 

a. In three cases out of 5 the R-W is clearly and consistently a more 
valid score than Number Right, when the number of statements is 
above 50. 

b. In one case (French) R-W is slightly better than R when the 
number of statements is 100, with negligible differences when N = 
20 and 50. 


c. In the fifth case (Anatomy) the differences are negligible. 


50 — 
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Cuart 6.—Pleading and practice validities. (See Chart V.) 


When the number of true-false statements is above 50, R-W is a 
more valid score than Number Right in all the examinations save 
Anatomy, where there is approximate equality. If we arrange the 
examinations in an ascending order of problematic quality of the ques- 
tions (Anatomy, French, Pleading, Torts, Property) we thereby 
arrange them in ascending order of differences between R and R-W 
validity coefficients. As noted above the Anatomy and French tests 
are much nearer to the “information type” of true-false questions 
than the law examinations. In the former, we have only negligible 
R — (R-W) differences, although it is to be noted that the number of 
T-F statements is only 100 in French and 130 in Anatomy, and it is 
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possible that significant differences might appear, even in the simplest 
T-F information test, if N were 200 or more. In the law examinations 
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CHART 7.—Property test validities. (See Chart V.) 
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Cuart 8.— Torts test validities. (See Chart V.) 


we have a veritable opposition between Reliability and Validity of 
R and R-W scores. One thing is certain: R-W scores do not suffer 
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in a single case by comparison with R scores as to validity, whatever 
may be said of their comparative reliabilities. It may well be that in 
simple information types of 7-F tests the extra time and work of 
scoring R-W is not justified by the small gain in validity, when ‘Do 
not guess’’ directions have been given; but the writer’s feeling is that 
R-W should be favored, not only because the evidence does favor it 
as being more valid, but because of pedagogical desiderata. The 
writer would favor “Do not guess”’ instructions for the same reason, 
until positive experimental data to the contrary are produced. 

The greater validity of R-W scores on the Law tests is not appreci- 
ably due to the fact that the criterion was made up partly of R-W 
scores. Using the average of six old iype final examinations as a 
criterion, we find the R-W about as much superior to R for the total 
R-W and R scores as is indicated in Charts 6, 7 and 8. Table X 
and Chart 10 give the facts in detail. 


TaBLE X.—Law Test VALIDITIES 


Correlations of Total Scores, R and R-W, on each of three law examinations with 
the average of six old type final examinations given at end of first year law course 








Pleading Property Torts 

180 T-F items | 200 T-F items | 130 T-F items 
ee, cedecae cekbko wen’. 0.688 0.705 0.605 
PE ites cen anweeeenes 0.744 0.745 0.674 
Dir Re ch 40s 6 oa cuswedean 0.726 0.710 0.796 
ae véinds-06s chdvennede 0.668 0.667 0.739 

















The opposition between Reliability and Validity set forth by our 
data is a warning against overestimating the importance of small 
differences in reliability coefficients and particularly against a general 
exaggeration of the meaning of reliability coefficients as such. There 
is no doubt but that in general the highest possible reliability is to be 
preferred, other things being equal; but not at the expense of validity. 
We must not forget that a test with r,, = 0.64 may give r.. = +/.64, 
and may thus be a better test than one with r:, > 0.64, which measures 
something not so highly related with c. 

These data also indicate that we have to a certain extent misinter- 
preted the meaning and underestimated the importance of the differ- 
ence between the R and R-W methods of scoring. We have more or 
less assumed that ‘‘ penalties” are proper only so far as they correct 
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for chance influences, and are not justified insofar as they do more, or 
less, than correct for chance. Consequently our efforts to establish or 
dis-establish the ‘‘conventional corrections for chance’”’ or “‘ penalties” 
have mainly taken the direction of showing that they go beyond, 
or fall short of, negativing ‘‘guessing.’”’ We have even gone so far as 
to accept the judgments of examinees as to what items they ‘‘ guessed ”’ 
at, and this in the face of the fact that the students seem to be nearly 
as often wrong in marking items they ‘‘know’”’ as in marking items they 
“‘ouess at.’’! 

It seems clear, however, that in choosing between R and R-W 
scoring methods we are choosing between two fundamentally different 
tests—between two tests which measure different functions, not 
between two tests which measure the same thing with slightly different 
degrees of reliability.2 The difference in what they measure may not 
be as great as the difference between the reliabilities with which they 
measure whatever they measure; but it is more important. The cor- 
relations between R and R-W total scores are as follows: 


r k 
pS | ee ee 0.921 0.39 
Property (200 T-F items)......................... 0.876 0.482 
EO eee eee eer 0.872 0.484 


Reference to the reliability coefficients of R and R-W scores of 
these tests will show that the small magnitude of rg_ce_w)is not 
wholly accounted for by low reliabilities and indicates that the differ- 
ences in what R and R-W scores measure are probably much greater 
than is indicated by the data in Charts 5-9. 

Ruch’s opinion, that the amount of ‘‘sheer guessing”’ on the part 
of students has probably been grossly overestimated, seems to the 
present writer to be more consonant with common sense and the avail- 
able evidence than the opposite view. The better half of this opinion 
is that we may consider all but a small fraction of a student’s responses 
as genuine reactions, ee his mind in a manner somewhat 
more real than by an imaginative coin-flipping. Thus the wrong 
responses of a student, insofar as they are the result of judgment on 
his part as serious and as honest as that which produced his correct 
responses, must be considered as revelations of his mind no less than 


1 West: Critical Study of R-W Method. J. E. R., 1923, 1-9; also Ruch: op. 
cit., p. 117. 

1 This idea was clearly suggested by the late lamented Chapman: Journal of 
Applied Psychology, 1922, p. 342-348. 
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his correct responses are so considered. To ignore the mistakes of the 
student merely because they are honest mistakes, due to actual mis- 
information, may be just as bad from the measurement viewpoint as to 
ignore his right responses and take into account only his wrong 
responses. To treat wrong responses as omissions is undoubtedly 
bad economy— it is equivalent to throwing away a part, perhaps a 
very significant part, of a test. Indeed, it may well be that the main 
value of the “corrections for chance” is not that they correct chance 
aberrations but that they “penalize the student for honest mistakes,” 
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Cuart 9.—Anatomy test validity CuartT 10.—Law test validities showing 


showing in terms of K the validity of a in terms of K the comparative validities of 
130-item T-F test in Anatomy, when’ total R and totai R-W scores on each of 
S = R andwhen S = R-W, as measured’ three 7-F law tests, using the criterion 
against the three criteria described in described in Table X. Circles show K's for 
Table XI. Circles show k’s when S = S = Rights, crosses for S = R-W. 

Rights, and crosses when S = R-W. 


and thus make the score depend upon all the responses of the student 
rather than upon an arbitrary fraction of them. 

Let us consider a hypothetical case, purposely made extreme. 
A, B and C are high school seniors who have just taken their final 
examination in American History. A knows every item in the 100- 
statement 7-F test which constituted the final examination and has 
marked them all correctly; B knows not one item, and, obeying the 
instructions not to guess, has not marked a single statement; C knows 
all the items but knows them wrong, and so marks every item incor- 
rectly. That is, C knows that Washington was the tenth president, 
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knows that the American Revolution began in 1865, that Jefferson 
Davis built the Union Pacific, that the isothermal lines of 0°C. and 
20°C. approach each other more closely on the Pacific coast than on the 
Atlantic, that the first transcontinental railroad traversed the Gadsden 
Purchase, that the ‘‘sea level route’? from New York to Chicago is 
the Pennsylvania Railroad, that inland temperatures are less variable 
than seaside temperatures, knows that manhood suffrage was estab- 
lished immediately after Cornwallis surrendered his sword into the 
hands of General Lincoln at Yorktown, etc. 

It has been argued that the ‘‘chance correction’? would over- 
correct C’s score since he is not guessing but is merely in error, honestly 
misinformed. Perhaps a rigorous application of the conventional 
correction here would give too much weight to misinformation as an 
index of achievement in American History; but few would defend a 
scoring scheme which would give A, B, and C scores of +100, 0 and 0, 
respectively. Off-hand common sense would favor scores +100, 
0 and —100 in this case, because to be actually misinformed about 
any considerable number of such items would seem, under normal 
circumstances, to indicate a dullness and a critical incapacity more 
pronounced than correct information would indicate of positive ability 
and alertness. But the matter is probably more complicated than 
appears on the surface; nothing short of exact and extensive investiga- 
tions, with the aid of the partial and multiple correlation technique, will 
enable us to assign to the correct and incorrect responses and omissions 
the weights which would be best for specified purposes. That we 
cannot afford to throw away all the questions marked incorrect is at 
least suggested by the validity which they possess in their own right. 


TaBLE XI.—VALIDITIES OF Four Metuops or Scorine 7T-F Tests 


Showing the comparative validities of R, R-W, W and 0 scores on indicated T-F 
tests. The criteria are as described for Tables V-IX. N = 100 for all these 
r’s except those in the second line where N = 74 








: Number 

Tests Rights | R-W | Wrong ni ee 

III od Sn db eb wei eececven .706 .747 — .636 | —.475 
Pleading 180 items (N = 74)............. .688 .744 — .550 | —.460 
Property 200 items. ..........2c0ecceee, .705 .745 — .565 | —.283 
ES ee re .605 .674 — .600 | —.180 
I TB. on koe eb e'ss sv eccses .649 .640 | —.420 | —.460 
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Only the partial correlations, however, would tell us the weights 
which would give us the highest relationships with the criteria. The 
differences between the magnitudes of the correlations for the omis- 
sions are especially interesting. The three law examinations were of 
the same type and were given to the same 225 students within the same 
week with identical directions; yet the raw correlations between the 
criterion and the “‘omission scores’ range from —0.18 to —0.46. 

It seems strange that so little attention has been given to the 
problem of weighting omissions. We have apparently assumed that 
the proper weight would automatically fall on omissions if we ignored 
them. The main argument about the way to treat wrong responses 
has involved the simple dilemma: Shall we treat them as we do omis- 
sions, or make the score equaltoR—(W/n-1)? Thisover-simplification 
seems to be due in large part to the fact that most studies have been 
made from the viewpoint of reliability, of correcting for chance, and 
have somewhat neglected the validity approach. 

We might go even farther than Ruch in suggesting that we have 
probably overestimated the guessing factor, and say that we have 
often created this factor by directing the victims of our tests to guess 
at questions about which they know nothing, thus putting ourselves 
in the ambiguous position of seeking to escape a small chance factor 
by purposefully making it larger. Of course, the usual justification 
for this procedure is that certain retiring and shy, or intellectually 
cautious, examinees might not do anything on a test unless they are 
told to guess whenever they cannot answer by any other means. This 
is, at best, a very doubtful way of overcoming negativism in individu- 
als, and of protecting the intellectually cautious from suffering by 
comparison with the intellectually careless, not to say dishonest, who 
will ‘‘guess”’ in any case at all items they do not know. This theory 
seems to be highly colored by the old punitive attitude towards exam- 
inations, which involved the assumption that most students would 
always prefer to get credits by stupid or crooked methods even if 
they were able, or thought they were able, to get them by an honest 
exercise of intellectual gifts and good judgment. 

The high relationships of wrong-answer scores with the criteria, as 
set forth in Table XI, show that chance alone did not produce the 
wrong responses; in the Torts examination the wrong responses have 





1The writer hopes to present some partial and multiple correlation data in 


the near future, using the results of T-F, Recognition, Recall and Free Answer 
Problem Tests. 
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as much validity as the right responses. Bad judgment and actual 
misinformation evidently played a significant part, and there is no 
more reason for throwing away the wrongs than for throwing away 
the rights. If the correlation between Rights and Wrongs is very 
nearly perfect, then either Rights or Wrongs may be thrown away, 
other things being equal. 

It is more or less customary to assume that chance and guessing 
do not enter into the answers in essay and recall forms of tests. We 
are quite willing to believe that a student is “honestly mistaken”’ 
about all or most of his wrong responses in a recall test, and that he is 
equally honestly and knowingly right and correctly informed about 
all items which he has answered correctly. But we are more than 
willing to believe that he is knowingly guessing, not only on his wrong 
responses, but on an equal number of his right responses in a true- 
false test. Thus we justify the R-W scoring method; but our reason 
here is not as sound as our practice. Except where we make him 
otherwise by unwise directions, the student is probably very nearly 
as honest and conscientious in 7-F as in Recall responses; it is also 
quite likely that in both forms of test he guesses neither more nor 
less often, proportionally, on his wrong responses than on his right 
responses. Our rationalization here has escaped exposure because 
the R-W method is justified not merely as a means for somewhat 
attenuating the effects of gross guessing, but also, and perhaps mainly, 
as a means for taking advantage of a greater fraction of the inherent 
validity of our tests by differentiating students on the basis of relative 
degrees of misinformation and of bad judgment, as well as on the basis 
of relative degrees of correct information and of good judgment. 
Some students are undoubtedly as far below zero knowledge of, or 
ability with, a given question as others are above zero; why, then, 
count all who answer it wrongly, thinking they are right, as just 
equal to those who in their own judgment had zero ability and omitted 
it? If the students have been told not to guess, and the R-W scoring 
method has been explained to them, we have two good reasons for 
giving minus credits instead of zero credits for wrong responses; one 
meets the assumption that the wrong answers are due to guessing and 
the other that they are due to honest misinformation or bad judgment; 
if the student guessed wrongly once, he has likely guessed rightly 
once; if he is misinformed or has used bad judgment, he is likely 
just as far below zero as a genuine correct answer would place him 
above zero. 
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In this connection it is interesting to note that very little experi- 
mentation along these lines has been bestowed on Recall forms of 
tests. No one knows the relative validities of Rights, Wrongs or Omis- 
sions on Recall tests. Apparently, it has never occurred to any one to 
study the effects of minus credits on recall tests, even from the view- 
point of “corrections for chance,” let alone from the viewpoint of 
validity. It is evidence not only that we have too much ignored 
validity as against reliability in our experimental studies of tests, but 
that we have been too willing to make plausible assumptions as to 
the applicability of “corrections for chance.’”’ It has been my experi- 
ence that students ‘‘guess” as grossly on recall questions as on other 
types, if their statements in confidential conferences on examination 
results can be trusted; and it is well known that students often give 
as many wrong as right responses on recall tests. 

Let us consider, for example, the case of the young student of 
philosophy who writes anent the Cartesian Revolution that “It was 
the uprising of the laboring classes in England to secure their rights 
from the greedy landlords; the Charter comes from Magna Carta.’ 
If this were known to be a guessed response, I should be content to 
assign it minus one credit; but if, as the student assured me, it was an 
answer given with full conviction of its truth and correctness, I should 
be inclined to assign a credit of minus two or minus three. 

Obviously, tests which involve situations of this sort are as worthy 
of experimental study, from the viewpoint of both validity and reli- 
ability, as the 7-F and Recognition forms. A priori arguments will 
avail us little. Large scale experimentation with the partial correla- 
tion technique and careful attention to criteria is needed. 

The fundamental argument here is based on validity or significance 
as against reliability. We have shown that there is an opposition 
between reliability and validity of R and R-W scores in the law exam- 
inations above. It is certainly possible that spurious high reliabilities 
may be secured by juggling accessory conditions such as time-limits, 
method of recording answers and the like, without securing high or 
even maintaining moderate validities. 

Let us assume that we have administered a 200-item 7'-F test to 
1000 children, not one of whom knows enough about the subject-matter 
to answer a single question correctly; that they were told to guess at 
all items which they did not know; that 100 children marked only about 
20 items, 100 children only about 40 items, . . . and only 100 chil- 
dren marked approximately all the items. This last supposition is not 
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as violent as may seem at first sight. Dr. Ruch! adverts to the inter- 
esting fact that some “‘morons in the third and fourth grades go straight 
down the test page, marking as ‘true’ or as ‘false’ items of eighth or high 
school grade difficulties,” and marking them with apparently absolute 
confidence and dignified seriousness. Dr. Ruch has also undoubtedly 
observed that some of these morons will spend the entire period in 
marking the first page, while others will reach the second or third or 
fourth page by the end of the period. The equanimity and gravity 
with which many of our school children go through such motions, 
‘“‘repeating the words without knowing the tune,” might be commented 
upon from several viewpoints of interest to the educational philosopher 
and administrator; but the point to be noticed for purposes of this 
discussion is that if we score this 200-item 7-F test counting Number 
Right as the score, the correlation between evens and odds will prob- 
ably be surprisingly high, certainly higher than rogasevens When 
S = R-W. A priori one would guess that the reliability when Score 
= Rights under the conditions assumed above, might be as high as 
0.80 or 0.90 and when Score = R-W very near zero; but it is clear that 
both r.z and r.(rz-w) would be +0.00. Since children are as they are, 
and ‘‘guessing-speed”’ seems to vary almost as much as “thinking- 
speed,” one is tempted to make the unorthodox generalization that 
when Score = Rights on a True-false test, 7 .vens-odds Will increase as the 
number of students who guess increases. 

The quest for reliability as such needs no apology in the eyes of 
the scientific world; but in the matter of practical improvement of 
measuring instruments for the use of teachers, let us not forget that 
our real objective is validity per unit of time; by pursuing this objective 
we shall render full homage to both validity and reliability coefficients. 


1Op. cit. 





(Part II will appear in the February issue) 
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THE STATUS OF UNIVERSITY INTELLIGENCE 
TESTS IN 1923-24 


HERBERT A. TOOPS 
Ohio State University 


In February, 1924, a questionnaire containing twenty-five queries 
on the administration and uses of college intelligence tests was mailed 
to 110 colleges and universities throughout the country. No effort 
was made to make these representative of the college situation in the 
country. One may judge for himself from the names of the colleges 
below the degree of representativeness of the colleges selected. The 
aims was not so much to determine accurately what per cent of 
American colleges are using such tests, but rather to determine certain 
working principles which have been developed by universities through- 
out the country in regard to the use of tests. As the result of a series 
of follow-up letters, the returns were finally made complete, with 100 
per cent of replies. The results are undoubtedly representative of 
the practices employed in such of these colleges as use intelligence 
tests. To 50 colleges in the state of Ohio, however, only questions 
1 to 5 inclusive were sent out. We shall discuss first the results of 
the national questionnaire and then the results of the Ohio question- 
naire. The procedure followed will be that of presentation of the 
tabulated results according to question numbers. 


Part A. THe NATIONALLY CIRCULATED QUESTIONNAIRE OF 25 
QUESTIONS 


Question 1.—Are intelligence or “College Entrance”’ tests being 
used by official authorization in your school as a part of the college 
administration routine, 1923-1924? 





Answer “‘ Yes” or “ No.”’ 


Question 2.—After official authorization of such routinely admin- 
istered tests, what was the date of their first use in y our college? 





First date of use after authorization. 


The tabulated answers to these two questions will be found in 
Table I. The ‘‘x’’s of the table indicate the years in which tests were 


given. 
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TABLE I 





Name of college 


Years that tests have been used 





1918 or 
earlier 


1919- 
1920 


1920- 
1921 


1921- 
1922 


1922- 
1923 


1923- 


1924 





Alabama Polytechnic Institute........ 
University of Arizona................ 
ahd, REAR Se A 
Boston University................... 
Brown University................... 
Bryn Mawr College.................. 
carnegie Institute of Technology...... 
RRR epee aI 
Cleveland School of Education........ 
Colgate University................... 
State Teachers College of Colorado.... 
Columbia University................. 
Connecticut Agricultural College....... 
Cornell University................0. 
Dartmouth College................... 
poem, A iy = EEO CER 
eorge Pea ollege for Teachers. . 
G ia State College of Agriculture... 
Goucher College.................e6-. 
Grinnell College.................005. 
«ic code els debascsces 
TERETE ES 
University of 
Indiana University...............e.-:. 
University of Iowa................-. 
Iowa State College................... 
Johns Hopkins University............ 
University of Kansas................. 
Kansas State Agricultural College..... 
Uaiveseley of aeeeny eoene pebeeences 
Lehigh University.................... 
University of Maine.................. 
M lS EERE RC Sate RR. 


University of Minnesota.............. 
Montana State College.............. 1 
Mt. Holyoke College................. 
University of Nebraska............... 
University of Nevada................ 
Newcomb College.................06- 
University of New Hampshire......... 
State University of New Mexico....... 
University of North Carolina.......... 
University of North Dakota.......... 
Northwestern University.............. 
University of Pennsylvania............ 
Pennsylvania State College............ 
University of Pittsburgh............. ke 
Rhode Island College of Education... .. 
Rutg@ere College........ccccccsccccecs 
RTE AEE I ee 
Stanford University.................. 
Swarthmore College................- oh 
8 ES oo. ot oe aia abcité wala 

niversity of Texas................. ‘e 
University of Utah................08. 
is no ot wi cenneene ase 
University of Vermont................ 
University of Washington............. 
Wellesley College.................... 
West Virginia University............. 
Woman's College of Alabama.......... 
University of Wyoming............... 
ee  caccsecnconeadeenen 
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Amherst College 

University of Chicago 
Colorado Agricultural College 
University of Colorado 

George Washington University 
Harvard University 
University of Illinois 
University of Michigan 
University of Missouri 

New York University 
Princeton University 
University of South Dakota 
Southern Methodist University 


1923-1924. TuHrrTy-ONE COLLEGES 


Bowdoin College 

University of California 

College of City of New York 
University of Delaware 

University of Denver 

De Pauw University 

University of Georgia 

Gustavus Adolphus College 
Louisiana State University 
Lebanon Valley College 
Massachusetts Agricultural College 
Michigan Agricultural College 
Mississippi Agricultural and Mechanical College 
University of Montana 

Oklahoma Agricultural and Mechanical College 
University of Oregon 

Oregon Agricultural College 
Purdue University 

Randolph Macon Woman’s College 
Reed College 

University of Rochester 

University of South Carolina 
University of Tennessee 

Tulane University 





25 


Part B. THose Coutueces Usinc Tests EXPERIMENTALLY, OR 
TEMPORARILY, OR IN A PART OF UNDERGRADUATE COLLEGES OF THE 
University ONty, 1923-1924. TurrRTEEN COLLEGES 


Part C. THose CotteEGEs Not Now Usinea INTELLIGENCE TEstTs,! 


1 Some of these colleges give tests to students of certain departments, but not 
to the entire student body. 
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Tufts College 

University of Virginia 

Virginia Polytechnic Institute 

Wesleyan University 

Western State Normal School (Michigan) 
Williams College 

University of Wisconsin 


The summary of the table indicates that in 1923-1924 there were 
66 or exactly 60 per cent of these 110 colleges officially using tests. 
The growth of the use of tests in these colleges since 1918-1919 has 
been very steady and consistent. During the academic year of 1918— 
1919, only 8 of these 66 colleges gave University Intelligence Tests; 
some of these had begun the tests previous to that year. The largest 
increase in the number, from 8 to 30, comes in the academic year 1919— 
1920, the year following the War during which year many colleges 
gave the Army Alpha Test. The Thorndike Tests also became avail- 
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Fic. 1.—Relation of size of college to adoption and use of intelligence tests, 1923- 
1924. Ninety per cent of the colleges not giving tests in 1923-1924 have a total enroll- 
ment of less than 4000 students: 50 per cent of colleges of the experimental group have a 
total enrollment of more than 4000 students. The colleges not giving tests are preponder- 
antly small colleges; the colleges of the experimental group are preponderantly large 
colleges. The size of 7 of the 110 colleges could not be determined. 


able during this year. In 1920-1921, there were 39 of the colleges 
giving the tests; in 1921-1922 there were 48, and in 1922-1923 there 
were 59, in 1923-1924, there were 66. In addition, 13 colleges were 
using tests experimentally in an attempt to evaluate their worth for 
possible later official adoption. The replies indicate that in all prob- 
ability several of the colleges listed under the second or “‘experimental”’ 
group at the present time, will have officially adopted intelligence tests 
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in 1924-1925 as well as also some of those colleges now listed in the 31 
colleges not using tests in 1923-1924. 

A distribution of the above three groups on the basis of number of 
enrolled students of the colleges in 1921—1922,' gives the curves of Fig. 1. 

Those colleges not giving the tests are preponderantly smaller 
colleges. Difficulties of adequate personnel for administering the 
tests is a factor which may have prevented earlier adoption of the 
tests. To meet, these needs of the smaller colleges, the larger univer- 
sities will probably shortly be giving professional courses in the 
administration and uses of tests in higher education. 

The colleges giving the tests on an experimental basis are prepon- 
derantly larger colleges. We may guess that the difficulty of making 
intelligence tests function in actual daily use by the faculty and ad- 
ministrative officers of large universities, has been one very impor- 
tant factor in explaining this situation. We sadly need techniques 
for gett. the test scores into actual use on a large scale. 

Question 3.—If the college or individuals have published any bul- 
letin or articles bearing on the uses and results of the tests we would 
appreciate receiving copies for our files; jor in the case of books and 
available publications, complete bibliographical references. 

The results of the tabulation of this question show that of 66 col- 
leges which gave the tests in 1923-1924, approximately two-thirds, 
45, have issued no publications. Obviously greater publicity of results 
on the part of those making use of the tests would result to the advan- 
tage of all as contributing basic data for the improvement of the tests. 
The 56 publications of the 25 colleges writing them are, of course, but a 
small part of the approximately 400 publications which are now avail- 
able in a bibliography on the subject.” 

Question 4.—What tests are being used, academic year of 1923— 
1924? (If not commercially available, we would be pleased to have 
you submit samples. ) 

Question 23.—What correlation coefficient is most characteristic 
of the relationship of your intelligence test to total freshman scholar- 
ship among unselected freshmen? 

1U.S. Bureau of Education, Statistics of Universities, Colleges and Professional 
Schools, 1921-22, Bulletin No. 20, Washington Government Printing Office 1924, 
161 pp. The enrollment of seven colleges could not be determined. 

2 The most complete bibliography published to date is given in McPhail, A.: 


“Intelligence of College Students.’”’ Warwick and York. Baltimore. 1924, 176 


pp. (295 titles in bibliography). The writer has collected a bibliography of 400 
titles. 
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Inasmuch as certain tests are used because of certain real or 
fancied advantages, chiefly as to validity in predicting college marks, 
these two questions were considered together. 
administrative time of each test was determined from the literature 
and other sources and the tests divided into six groups, according to 
total administrative time. 


In addition, the total 


TABLE II.—FREQUENCY oF UsE or CERTAIN TESTs IN 66 COLLEGES IN RELATION 
TO ADMINISTRATIVE TIME (WRITING TIME) REQUIRED AND TO CORRELA- 
TIONS BETWEEN INTELLIGENCE AND FRESHMAN SCHOLARSHIP 


























Validity coefficients; number of times reported: 
No. of 
Actual writing time and test used : | E | total | ‘test 
ctual writing time Not given 5 ini 5 = 2 2 ~ ‘0 - 
| 1 
g 32 | 3 [gis 
6. 150 minutes or over 
EE cescekenscuttescasacicees 5+4%+34 1 ly 3) 36\2+%)..| 3 |154K%a) 19 
Ph IE, vc ccukeeeatedabesaedos 
EE cnsctsnnttekéeniedstenlesercenseecece i ee oS | 1 veut ae 3 
4. 90-119 minutes 
Rhode Island College of Education.........|.........sceeelesccclecsceeecs a ee 1 1 1 
University of Minnesota Test. ............)...ceccecccecclecscclecceecess itches dceuenens 1 1 
3. 60-89 minutes 
CS EE eh Sen eee ys ee St 1 1 
Iowa Content Examination............... iia: oot: aaa. 4) ..| 1% 3 
Brown University Test................4+- 4g %| 14+% ss caleéuckios ..| 2% 4 
2. 30-59 minutes 
Eee eee (a en: es Ae Pes aes | | Pet Pee Cae 1 1 
University g2 a a Re, Stet es FR ee eo hae 1 1 
Th Mshcspsocaccsesecdessouses 4+4+144+%| 3 2 ae Re 1] 7%] 10 
ay far mal a PERRI ALE ETS, PEE a, EE TERE. | ssedoeggele. % : 
eed ciei nnn pogendesiaokes 2+39t36+34) % 114364345) 1) 11.....]..]--- 7 11 
Ohio State University Test................ a ° Bigeenbeeeusesee ae GT: OM aR 1 1 
EEE Tene, AFL I | i eR RR ey 4 1 
Massachusetts Institute of Technology.....|.............. B Bvetionce ath alesachbetesieGhens 1 
ci aibinsecersiccceuenssecsss ER Se PRE: hs 4 2 
tn hoe) ccnatabsebnehuecthisatescicetbclseeeuneessancosts 1 ek 1 
1. Under 30 minutes 
University of Washington Tests. ..........)....ccccccccnclecceclecceerccels + Bee ee j 1 1 
Terman i incméssshdeuncruakes 1 | 14% | 2...) 1 556 7 
Stone ess a M cadinhnawbabehkbeles ches 6c0nesencspenaanenndeale Re | pare a ee | 1 
Scott Company Test...............se00-. S 8 «- "- Riceeiiedtonesaneieatnastsueastesioas 1 
Rand I (lows State College) VRE EG  . ‘Bisdaghbveddeseens héschvcovilednbn ea 1 
oore 00d 04 on0avnderessetenl soccecceccccscheseeeibensessce «| EI. -0ee ees 1 
— Ui cenckdbédededdebbutdhdectescsaten ij osaabebae ee RE! oo : 
GUANA LATAMMAL.... 0... cece ccc cceeccecl|eeereeesescens BQ Jocccccccclocioccfocccclecioce 4 
Illinois General Intelligence...............)...20--eeeeeee ME Nocasconcalevicbaletasdhacnete 1 
i iMicwnccsceucegutetas ie ee seer eS KEP 1 
C. B. A. Mental Alertness................ aes a Sa & cubeb dente bdes 1 
I Ebncarevccesescsccsnccssesees 3+4+4+%/14+% 4g aig Ut ecee dhauean 6%2) 10 
Rass Ls chabecgeesedéadaenees S  Reaeeeenerens ap, Pie Ss a Pree 1 1 
6. 150 minutes or over..............000e000- 534 1 3) 4) 2% |..| 3 |15'Ka| 19 
ink cvs cu hvs ckcubeher ness tbhhesenend Oe PS -| 3g) 1 ...| 2% 3 
¢ oo eae oadasecesecéadeenaseunewedee gtveooaing iz’ vesgaaee! Niel ig’ |": 1 Gs : 
60-89 minutes...............cc cece eeees ca ae Pe 
5 ERR RE ER ep ee 7% 4 4hy 1; 2 | 1% }.. % 21143) 31 
1. Under 30 minutes................000000+5 73% 2% 1% wet w te. 18% | 27 
akin cncnnencd ciebnanen ee oe Pere oS) TES MA : 1 1 
Total number of colleges................++.- 23 a) 5 7,6) 7 |..;6 | 66 91 



































(Note: If a given college uses two tests, each of the two tests is given one-half of a frequency in the table; if 
three tests, each test receives one-third of a frequency, and so on.) 
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The results are shown in Table II. In this table, a college using 
one certain test only and getting a certain validity is indicated in the 
table by the integral number, 1; two colleges using a certain test only, 
by the integral number 2, etc. Thus, one college using the University 
of Washington Test only secures a validity somewhere beween .52-.55; 
three different colleges using each the Thorndike Test secure validities 
of .60 or over. Those colleges which use two or more tests or scales, 
have the unit frequency evenly distributed in the column to the two 
or more tests used. Thus, examination of the .40-.43 column shows 
that a certain college secures a validity coefficient between .40—.43 
from a three-test battery of Army Alpha, Terman Group and Otis Test. 
The last column on the right hand side of the table shows the number 
of different times each test is used. The Thorndike Test is used most 
frequently, by 19 colleges; the Otis next, by 11 colleges; the Thurstone 
and Army Alpha next, each by 10 colleges; and Terman Group by 7. 
This indicates a marked falling off in the use of Army Alpha as reported 
by Whipple (’22). Itis apparent that many of the most valid tests are 
used only by the colleges which originated them. The reliance on 
commercialized tests—known in certain cases to be much too easy 
for students of college level—is also marked.! Colleges making use 
of such ‘‘too easy”’ tests frequently hope to boost their validity predic- 
tions by the use of two or more such tests. Thus, 66 colleges use in all 
91 tests (not different tests), or an average of 1.38 tests each. This 
‘attempt has apparently been largely disappointing, save in those 
cases where at least one of the two or more tests used, has itself had 
high validity. There are 30 different tests used by the 66 colleges. 

In Table II we note that 23 of the 66 colleges using tests by official 
authorization have not determined the predictive value of the tests used. 

Obviously this table indicates a large advantage in point of high 
validity coefficients in favor of the longer-timed tests as compared with 
the shorter-timed tests. The shorter-timed tests would appear to worse 
advantage if they were given alone, rather than frequently in combination 
with other tests as indicated in Table IT. 

The median validity coefficient of 43 colleges which report their 
validity coefficients is .46. The distribution of validity coefficients, 
taken from Table II is given in Table III. 


1 This has also been noted by Laird and Andrew (23). Laird, D. A. and 
Andrew, A.: The Status of Mental Testing in Colleges and Universities in the 
United States. School and Society, Vol. XVIII, No. 464, Nov. 17, 1923, pp. 
594-600. 
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TaBLeE III.—DisTrisvTIon oF VALIDITY COEFFICIENTS OBTAINED IN 43 COLLEGES 


NuMBER OF 

Va.ipiry CoEFrFricimnts Co.LizGrs 
ne. 6s WL ae aa os Picket aa ee 6 oa Mk OK ahent q 
ES Salt ee ee eC ig ca ad Soka aa Owe Ea obs Sees 8 
es Alea cee IS OS cs cant wc wb wig aiiraiaekbe be 7 
SEITE eA Ch SIE PE ae 8 me BR mR US REM fee oS 6 
SR ees ee i” as 4 a ateia AG whe ana we. Rakes bao oe 7 
a ee a pc a es le a 0 
Es cio oka Gee. 6d pm ah Wake 4 aad et eae 6 
EE A's Se BME SOS swe KM a bo Ole wate kk 84 DLE eee eeae 43 


To one conversant with the magnitude of the standard error of 
estimate of small validity coefficients, it would appear that few of 
these colleges have tests long enough in time limits, or of the proper 
character, such as to yield minimally acceptable validity coefficients. 
Such validities will, however, in many cases compare favorably with 
alternative methods of old type entrance examinations. A great 
need for improvement is evident. As indicated later, one big improve- 
ment will come by lengthening the time limits of the total examination, 
and increasing the number of questions or test items accordingly. 

An examination of the tests in the list of tests of Table II shows 
that undoubtedly some of the tests are too easy. 

Whipple! showed in 1922 that the Army Alpha is decidedly too 
easy for college students, and he cites its greater difficulty as being 
one of the advantages of the Thorndike Test. 

Another obvious improvement would come from a standardization 
of the period of time over which college marks should be collected for 
computation of a validity coefficient. 

Question 5.—For which of the following purposes (include any 
others) are the ratings used? 

Check with XXX your most important use of the tests. 

Check with XX all other important uses made of the tests. 

Check with X all other uses made of the tests but which can 
scarcely be called “important.” 

(Here followed a checking list of the 15 uses tabulated in Table 
1V and blank lines for additional uses.) 

Six colleges of the 66 using tests failed to answer this question. 
The remaining 60 checked the one or more uses of tests made by it 





1 Whipple, G. M.: Intelligence Tests in Colleges and Universities. T'wenty- 
first Yearbook of National Society for Study of Education, 1922, Part 1, pp. 253-270. 
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with an aggregate of 341 uses of tests, or an average of 5.7 of the 15 
different uses per college. This indicates that colleges do not look 
upon the intelligence test as a mere entrance test or as an idle curiosity, 
but are beginning seriously to use the tests in many administrative 
situations. In addition, many colleges supplied one or more addi- 
tional uses not included in Table IV, which if included would raise the 
average number of uses per college to approximately 7 per college. 

The distribution of uses, according to the 15 classes of uses specified 
in the questionnaire are given in Table IV. 


TaBLeE IV.—DistrivtTion or 15 TasuLatep Uses or INTELLIGENCE TESTs, 
BY NuMBER oF CoLLEGES sO UsinG THEM, AND ACCORDING TO ImPoR- 
TANCE OF THE Use SPEcIFIED 














jo} 

| $| 2 |2s, 

i = 214 |°as 

Use made of test | 3 $| 8 § E 3 

| 2 | ek Be 2|5é¢ 

| KS | we ne z23 

As the sole basis for admission........................ 0 0; 0 0 

As a partial basis for admission. . Pie te Ce 5 7171 #19 

In determining dismissal for low scholarship. 15} 23/11| 49 

In determining probation for low scholarship. . Ue teeta 6} 19| 9] 34 

In dealing with disciplinary (deportment) cases..........| 20 8} 2] 30 

In determining amount of school work to carry..........| 18 | 14] 4] 36 

In determining amount of work for self-support......... 14 4} Oj 18 

Encouraging bright students to undertake graduate work.| 11 10} 4) 25 
Encouraging extra effort in case of unmotivated bright 

RSD hatte rotten Ley set dete Maw ee ou eae oe 14} 18]10] 42 

SELLE FOE ET FPP TP 10 3] 0] 13 

Making recommendations for scholarships.............. 10; 10] 3] 23 

Making recommendations for fellowships......... .. a 7 612] 15 

In hiring student clerical help......................... 10 2; 1 13 
In’ determining membership in honorary scholastic 

SITS SS a OO eae aA Oe eee Pee 5 1}; 0 6 
In sectioning students in the department 

according to capacity for progress..................-. 3/ 10} 5] 18 

RPK 5 lg Wk bs EG ls ho a ANS ht UT ON oo Cee 148 | 135 | 58 | 341 

















The last column on the right hand side of the table gives the total 
number of colleges making a given use of tests. The most frequent 
use is ‘“‘in determining dismissal for low scholarship;” the second most 
frequent, ‘‘encouraging extra effort in case of unmotivated bright 
students;” the third most frequent use, ‘in determining amount of 
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school work to carry;’’ the fourth most frequent use, ‘‘in determin- 
ing probation for low scholarship;” the fifth most frequent use, 
“in dealing with disciplinary (deportment) cases;” the sixth most 
frequent use, ‘encouraging bright students to undertake graduate 
work;” the seventh most frequent use, ‘‘making recommendations for 
scholarships; the eighth, ‘‘as a partial basis for admission.” It is 
interesting to note that no college uses the tests as the sole basis for 
admission and only 19 use them as a partial basis for admission. 
College intelligence tests are not entrance tests. They are primarily 
educational administrative devices for dealing with administrative 
and pedagogical problems of students, rather than the criteria of 
intelligence of applicants for admission. 

Consequently a test which is merely scaled so as to yield good 
differentiation of the to-be college failures only is not adequate for 
general college use. The multitude of other uses for the tests made by 
the 19 colleges which use the tests as a partial basis for admission is 
shown by the following summary: Three colleges use the tests for ten 
or more purposes each; one college uses them for nine purposes; one 
for eight; two for five; four for four; two for three; two for one purpose, 
while only three colleges use tests for admission only and for no other 
purpose. 

Obviously high correlation with college success of the test at all 
parts of the scale is required in a test adapted to the functional 
demands now put upon it by college authorities. On this basis, the 
Army Alpha, and probably a host of short-timed commercialized 
tests which perhaps differentiate fairly well between ‘“‘to-be college 
successes’”’ and “‘to-be college failures,’’ are but poorly suited to general 
college needs. 

The extent to which sectioning of students according to ability to 
make progress in academic work has progressed is shown below. 
Twenty-two colleges of the 79 using tests officially or experimentally 
conduct sectioning in 31 departments, or slightly less than 114 depart- 
ments per college attempting sectioning. It is interesting to note 
that English is the department most frequently making use of sec- 
tioning, with nine colleges doing sectioning in this department. Psy- 
chology comes next with four; then mathematics with 3; and Romance 
languages and orientation courses with two each. Apparently each 
of the above departments are mostly prescribed courses and ones 
which obey the desirable requirement of Having a very large number of 
students taking the course and reciting at the same hour. This recalls 
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an administrative difficulty of scheduling which has not been met in 
most places where a comparative freedom of electives prevails and 
which must be satisfactorily solved before sectioning can be done at 
all. The comparatively small use made of sectioning, does not neces- 
sarily mean that sectioning in colleges is not feasible or has failed. 
The figures probably can be correctly interpreted to mean that the 
administrative difficulties of scheduling have not been satisfactorily 
met by most colleges. Some colleges meet the problem by having 
preferably at least three sections at the same hour, the student then 
elects the hour and not the instructor or section, the intelligence test 
scores being available at the opening of the semester, the pupils are 
assigned on the basis of the intelligence test scores and the list is posted 
before the first meeting of the semester. This plan requires the avail- 
ability of test scores at the opening of the semester, which in turn 
requires the administration of the test before the opening of the 
semester and sufficiently long in advance of its opening to allow for 
the scoring of the paper and derivation of each pupil’s score. As 
shown later by the results of tabulation of question 11, less than 10 
per cent of colleges give the tests sufficiently in advance of the opening 
day of school to allow for scoring of the papers. The inevitable 
result, in the case of all large colleges, is that the scores are not avail- 
able in time to be used for sectioning entering freshmen. 

Sectioning can then only be done in the case of second-semester 
freshmen (after the time for its greatest need has passed) or in the case 
of upper-classmen, of whom there are often too few to section, or of 
whom it may be said, “‘they have already proven by their persistence 
an ability to survive the system and to get an education in spite of 
our arduous earlier attempts to eliminate them.”’ 

Additional uses of the tests given by colleges volunteering uses not 
given in Table IV are as follows: 

1. In studying the problem of teachers’ marks in departments in 
which the marks are much above the median and intelligence test 
scores below the median. 

2. In notifying instructors where the scholarship is not up to 
expectation from the tests. 

3. Admission of rehabilitation men who had not completed com- 
mon school work. 

4. For research purposes of graduate students, or in classes in 
““mental tests.”’ 

5. For establishing ‘‘control”’ sections in research work. 
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6. As a basis for individual consultation for improvement of study 
habits. 

7. For appraisal of value of transfer credits in order to save time 
of the executives of the scholarship committee. 

8. Experimentally to see whether they may be used as a satisfac- 
tory basis of recommendations for admission to ‘‘honors”’ courses. 

9. For advising election of major subject. 

10. For appointment of students to non-academic offices. 

Question 6.—To what extent does entrance to the college or 
university depend upon the scores made in the tests? 

Entirely? 

Partially? 

Not at all? 

This question duplicates, is a check on question 5, and uses a and 
b. As stated above it resulted in the conclusions that no college 
uses tests as the sole means of admission; that 19 colleges only (29 
per cent) of the 66 colleges using tests use them as a partial basis of 
admission; that the remaining 47 colleges (71 per cent) make no use 
of tests for admission purposes. 

“ Question 7.—If you answered “‘entirely’’ or “partially” to question 
6, please state specificially the basis for your answer (7.e., in determin- 
ing entrance, what variables are taken into account, and how they 
are weighted ?). 

An inspection of the answers of the 19 colleges which use tests for 
admission purposes indicates that no fixed relative weights to be 
attached to intelligence, scholarship, and the like are in use; instead 
the decision is a matter of subjective judgment of the person or persons 
responsible for the quality of entering students. 

Perhaps the most progressive use of tests in this connection is made 
by one college which interviews all low scoring students and advises 
them not to enter, presumably basing its advice on known mortality 
of freshmen of each percentile score. It would seem that this method 
should be very effective. 3 

Candidates from non-accredited or non-standard schools are 
admitted on the basis of the tests in several colleges. Several also 
require the test to be taken by students of the lowest fifth, quarter, 
third or half of high school scholarship. Special, or over-age, students 
deficient in secondary school preparation are admitted on the basis of 
test scores in several colleges. Such students include “special” 
students, Veterans’ Bureau students, extension students and the like. 
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One college admits on probation only those students not making 
60 points on the Thorndike Test. Another college requires the test 
in case the entrance committee is in doubt regarding the qualifications 
of a candidate for admission. Others require, in addition to scholastic 
records and intelligence, such additional qualifications as character 
references, and a satisfactory composition to test the student’s English 
ability. These are of course subjectively judged. 

Question 8.—By what bureau or department or college official are 
the tests administered? 

A tabulation of the results from this question yields the following 
results: 

In 36 colleges the tests are administered by the Psychology Depart- 
ment; in 15 by the Department of Education; in 4 by the Entrance or 
Admissions Board; in 4 by the Dean; in 3 by the Personnel or Research 
Bureau; in 2 by the Registrar; and in 2 by a faculty committee. In 
view of the fact that the administration of intelligence tests is com- 
monly regarded as a technical matter, it is not surprising that about 
three-fourths of the tests given are administered by Departments of 
Psychology or of Education. 

Question 9—What is the total administrative time in minutes of 
all the tests, inclusive of distribution of papers, reading directions, 
collection of papers, and dismissal of students? 

Question 10.—How many minutes of the above time are the students 
employed in actually writing or reading tests and directions unaided 
by the examiner. 

The difference between the total administrative time and actual 
writing time may be called waste time, inasmuch as it adds nothing 
to the validity of the tests and might, on the other hand, often be 
reduced in amount to the obvious advantage of securing more test 
time. The figures indicate that the short tests are the greatest 
offenders; tests under one-half hour in length frequently waste as 
much time in administration as is spent on actual testing. The 
average figures show that for tests under 30 minutes in length we may 
expect a waste of 22 minutes on the average; for tests 30-59 minutes in 
length, 24 minutes; for tests 60-89 minutes in length, 26 minutes; for 
tests 90-119 minutes in length, 28 minutes; for tests 120-149 minutes 
in length, 32 minutes; for tests 150-179 minutes in length, 37 minutes; 
for tests over 3 hours long, 42 minutes. We scarcely need remark 
that any wasted time saved for testing may increase the validity coeffi- 
cient thereby. 
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We have shown in Table II that validity does increase with time. 
It is interesting to note that in the case of the seven colleges obtaining 
validity coefficients of .60 or above, none has less than an hour testing 
program, and only two have less than a two-hour program. In con- 
nection with a study of waste time in administering tests, the plan 


of having all tests (including practice tests where used) in one booklet, 
merits serious consideration. 


(Concluded in February issue) 











THE SPELLING CONSCIOUSNESS OF COLLEGE 
STUDENTS 


GORDON HENDRICKSON AND L. A. PECHSTEIN 
College of Education, University of Cincinnati 


In many life situations it is important to know whether one is 
performing his task correctly or incorrectly. More than conscience 
is needed to insure the satisfactory achievement of many tasks. In 
particular, spelling offers a field for the study of what may be called 
consciousness of correctness as distinguished from conscience, or the 
desire for correctness. Society demands not less than 100 per cent 
accuracy in spelling, yet few people are likely to learn to spell all the 
words they may, from time to time, add to their writing or speaking 
vocabulary. It is as desirable, then, to know when one should use 
the dictionary as it is to have a large vocabulary of words which can 
be spelled accurately without resort to a reference book, and as it is to 
be willing to consult one. Every stenographer knows this. _ 

A method for the study of spelling consciousness has been suggested 
by W. F. Tidyman.' Tidyman reports an experiment in which 100 
elementary school pupils spelled approximately 100 words each, 
marking their spellings in such a way as to indicate a judgment as to 
the correctness of their spellings. His general conclusion was that 
the children knew when they spelled words correctly, but were fre- 
quently unaware of their misspellings. Another study in this general 
field has been made by W. C. Trow, whose recent monograph on 
“The Psychology of Confidence’? indicates that for college students 
degree of confidence is a reasonably good indication of correct judg- 
ments but that this varies greatly for different situations. The present 
study is concerned with an application of Tidyman’s method, in a 
modified form, to the study of the spelling consciousness of college 
students. 

Sixty-seven sophomore women students in introductory psychology 
classes were tested in May for general spelling ability. The words 
used in this test will be found in Table I together with the percentage 
of misspelling for each word. These words were selected arbitrarily, 
the intent being to secure a considerable variety as regards spelling 


1 Tidyman, W. F.: The Teaching of Spelling, 1919; 91-96; also Do Elementary 
School Pupils Know When They Make Mistakes in Spelling? School and Society, 
Vol. XX, 1924, pp. 349-350. 

2 Archives of Psychology, 1923, No. 67. 
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difficulty, and to use only words which were at least in the reading 
vocabulary of the students tested and hence potentially in their writing 
vocabulary. Extremely unusual words were purposely excluded, to 
keep the test from being in any respect an information or general 
vocabulary test. Certain words which do not appear in Table I 
were given but eliminated from consideration either because certain 
students claimed! never to have met them (e.g., exigency, synchronous), 
or because they were spelled correctly by all the students (e.g., desper- 
ate, freshman). Words for which the Standard Dictionary permits 
two spellings were not used. Only two of the words used (principal, 
relieve”) are found in the Ayres Spelling Scale, which probably approxi- 
mates the thousand words which it is most necessary to be able to 
spell, and which are most generally taught in school. Table I is in 
the main a sampling of that wider spelling vocabulary which the 
college student must acquire for himself, if at all. 


Tas_e I.—List or 50 Worps UsrEp 1n SPELLING TEST, WITH PERCENTAGES OF 
MISSPELLING BY 67 COLLEGE STUDENTS 


ai kciisin's 56 40nd o& WR Ea® > Be ee ay ee ee 20.9 
aie ere, «dll ES 
Re Acerca = ie 19.4 
Sess lll SE er. 
SR re rr et oe SE IC A wietah o's <0 bv edinde ses 17.9 
Embarrass........ . 55.2 Accessible...... . 16.4 
ss a's xo hab wk ba ae os ge ere 16.4 
RS co ok wan cle cae eae s 44.8 Marriageable.................. 16.4 
rks ce axeades ene espe 43.3  Principal...... 16.4 
EET er. = | le 060 14.9 
a itil ses me ao sne's Hoke a 0 We aad é bed es aeneeh eae 14.9 
Eee ree = — lll ETT TOPE re 
sass is o0 das tee ebeede 37.3 Achievement.................. 13.4 
eG pS so 4 on wak a Re ~  -  j(j- ees, 
Abhorrence.............. Lape ane 34.3 Miscellaneous................. 11.9 
I hs 5 inn w Sed Sete 4 fF Sree 
an. oc 9p nit 0) 0 a 34.3 Accidentally....... . 10.4 
See re rere 7.5 
Incorrigibility................... 31.3 Ninth. 7.5 
Ee rs 4s ou das cgbel a Me EIS Or Pear 7.5 
Mortgage. . . 31.3 Laboratory........ 6.0 
a ee CNG An a 2h 0s Ss bee ES 6.0 
ss 6is-350 0s <x '2 e eCN ae eee 4.5 
Sandwich eM sch cheese t oeexsaenas 4.5 
I, hs dn oa ose ecehaen une 20.9 Proceed 4.5 





1 Later, when the test results were reported to the students. 
2 “Relief” in the Ayres Scale. 
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The test procedure was carefully standardized. Each word was 
pronounced, an illustrative sentence was given, and the word was 
repeated. No suggestion that a judgment on the spelling would be 
called for was made until all of the test words had been dictated. 
The subjects were then asked to read their papers carefully and indicate 
after each spelling whether they were sure that the word was spelled 
correctly, or were sure that the word was spelled incorrectly, or were 
doubtful about the accuracy of their spelling of the word. The letters et 
A, B, and C were used for this purpose, the letters with their signifi- " 
cance being written on the blackboard. _ 


TaBLE I].—PERCENTAGES AND NuMBERS OF CORRECT AND INCORRECT SPELIL- 
INGS, AND OF JuDGMENTs “CorREcT,” “INCORRECT,” AND “ DovuBTFUL;” 
Tora NumBer or Spetiincs TAKEN As 100 Per CrENnT 























Number of 
Percentages spellings 

Total number of spellings.................. aaeave 100.0 3350 
ok ee een suka eed eres neveee 72.8 2438 
I, avon counts st eveukdcebiee obs 27 .2 912 
ORE TE LT rr Oe eT 75.8 2539 it 
NS OEE eee 21.4 718 $5 
nin dG kih wail wae dae eenin see Pe 2.8 93 : 
Judged “‘correct;” a) 

EES ee ET Oe eT ee eT eee ee 61.6 2065 ‘e 

ois otek as <0 vu aee's hos ORR 14.1 474 rs 
Judged “‘incorrect;”’ f 

hi sa Un ones on ners kaa khnnwided 10.6 354 

Ses aed oe Pea wesw ies bce ees 10.9 364 
Judged “‘doubtful;” 

EES eee er eee 6 19 

I 5 6 y sin'l's Saae ne wah eenees 2.2 74 
RN ok. onbecenstuccers 72.5 2429 % 
MII HII ooo ccc ccc ccc ect cec ce sscce 24.7 828 
EE cee vcedscveccecescssasses 2.8 93 
Measure of spelling consciousness.......... . ey 47.8 1601 j 











Tables II, III, and IV present the tabulated data concerning the 
judgments passed and their relation to accuracy. Table II indicates 
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that the percentage of correct spellings is 72.8; that 75.8 of all the 
spellings were judged to be ‘‘correct,”’ 21.4 per cent to be “incorrect,” 
and 2.8 per cent to be “‘doubtful.”” There are obviously six possibili- 
ties for any one spelling of any one word. Any one of the three possi- 
ble judgments may be passed upon it, and it may be spelled either 
correctly or incorrectly. The number of spellings and the percentages 
of the total spellings falling in each of these classes are indicated in the 
next part of Table II. There are two possibilities for entirely accurate 
judgments, judging words spelled correctly ‘‘correct” and judging 
words spelled incorrectly ‘‘correct,’’ and similarly two possibilities for 
entirely inaccurate judgments. The percentage of accurate judgments 
is 72.5 and the percentage of inaccurate judgments 24.7, the remaining 
judgments being ‘‘doubtful.” The final figures in Table II will be 
interpreted later. 


Tas.Le III.—PrrcenTAGES AND NUMBERS OF JUDGMENTS ‘‘CorREcT,” ‘‘ INCOR- 
RECT,” AND “DovusBTFUL,” PAassED UPON THE CORRECTLY SPELLED WorpDs 
AND THE INCORRECTLY SPELLED Worps RESPECTIVELY 




















y | Percentages pre nag 
SETS GS AA DOS aE he 100.0 2438 
EE en Gia bani a. bu.6 6 « Soom eenn 84.7 2065 
a6 Soo oa ate in ba bo od acoe eR 14.5 354 
I UD Sc sce dcbewdeees “S eutaes 8 19 
i yo biden ends oaks envea ~ 100.0 912 
es Wi iui ens vie db e+ cov cauehe 52.0 474 
judged “‘incorrect”............ ua, (ee taeee 39.9 364 
os ingle mikes SW abe + 0:66 ¥en x | 8.2 74 





Table III analyzes the same data from a different viewpoint. It 
answers the questions ‘‘ what percentage of the words spelled correctly 
received each judgment” and “what percentage of the words spelled 
incorrectly received each judgment.”’ Only a small number of the 
judgments passed were ‘‘doubtful’’—0.8 per cent of the judgments on 
words spelled correctly, and 8.2 per cent of the judgments on words 
misspelled. The subjects were generally confident of their ability to 
distinguish the words they could not spell from those they could. This 
confidence was far from justified, for although 84.7 per cent of the 
words spelled correctly were judged to be “correct,” 52.0 per cent of the 
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. words misspelled were also judged to be “‘correct.’”’ Less than half 
(39.9 per cent) of the misspelled words were judged “incorrect.” 
This agrees with Tidyman’s findings noted above with reference to 
elementary school children, that they were frequently unaware of 
their misspellings, but also indicates that the college students were 
much more confident in their judgments than the children.' 

Table IV answers the question, ‘‘ what does the judgment passed 
indicate regarding the actual spelling of the word?” This table 
signifies that the judgment “correct” is a good indication (81.3 per 
cent) of correct spelling, that the judgment “‘incorrect”’ is completely 
unreliable as an indication of incorrectness (50.7 per cent), chance 
giving as accurate a judgment, and that the judgment “doubtful” is 


TaBLE IV.—PERCENTAGES AND NuMBERS OF CORRECT AND INCORRECT SPELI- 
INGS OF WorDs ON WHICH THE JuDGMENTs “CorREcT,” “INCORRECT,” 
AND “DovustTruL,” RESPECTIVELY, WERE PassED 




















Number of 
Percentages spellings 
ee. As Las wag db'e oe cbse dec 100.0 2539 
cd rbd bad caked sativa wenee oes > 81.3 2065 
PE TEE ae 18.7 474 
NT. on anapases cv cdnbenaccebndon 100.0 718 
ina os nd baie nace 0 wid hae we eh eed 49.3 354 
ee ee th snd ueeddha ewes 50.7 364 
i ie ed oem s wuae eacareib et 100.0 93 
iid Wil oh bene thos d'eeees eb eeen 20.4 19 
ae a ain cub eatimins se waa 06.4 79.6 74 





very much more indicative of incorrect spelling (79.6 per cent) than 
the judgment ‘“‘incorrect.” College students spelling words which 
they admit are in their reading vocabulary not only do not know when 
they spell words incorrectly, but further give evidence that doubt is a 
stronger indication of incorrectness than alleged certainty of incor- 





1 Tidyman’s data: of the words spelled correctly, 96.2 per cent were judged 
“‘correct;” 0.3 per cent “‘incorrect;” 3.5 per cent ‘‘doubtful;” of the words mis- 
spelled, 38.3 per cent were judged “correct;” 30.9 per cent “‘incorrect;”’ 30.8 per 
cent “doubtful.” His test was relatively easier than ours; the correct spellings 
by his school children were 83.3 per cent of the total spellings. 
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rectness. This latter fact is hard to explain. The writers are at a 
loss to account for it, and merely submit this as a suggestive piece of 
evidence on the psychology of doubt and confidence. 

All these facts point to a low level for the spelling consciousness of 
college students It would be of value perhaps to have a single measure 
of spelling consciousness. A simple form of such a measure may be 
taken on the analogy of the usual method of scoring tests requiring 
alternative responses, ¢.g., true-false tests. The usual method of 
grading such tests is to subtract the number of inaccurate responses 
from the number of accurate responses. In this case the number of 
entirely inaccurate judgments may readily be subtracted from the 
number of entirely accurate judgments. The result, for the total 
spellings of the 67 subjects, is startling. Spelling consciousness 
appears on the average for these subjects in this test to be 47.8 per 
cent (Table IT).? 

Large individual differences were found in both spelling ability and 
spelling consciousness, spelling consciousness being measured for each 
individual exactly as for the group, by subtracting the number of 
inaccurate judgments from the number of accurate judgments. In 
spelling ability the range was from 36 to 98 per cent for the test used, 
S. D. being 13.4 units of the percentage scale. In spelling conscious- 
ness the range was from 14 to 92 per cent, 8. D. being 20.8 units of the 
percentage scale. The mean ability in spelling was 72.8 per cent; in 
spelling consciousness, 47.8 per cent. 

The data of this study were used to obtain the relationship between 
spelling ability and spelling consciousness, and also, when supple- 
mented by further available data, the relationships between these 
abilities and general intelligence as measured by Army Alpha, and 
between them and scholarship as measured by grades in psychology. 
Table V presents these correlations. Spelling ability and conscious 
ness correlate with each other more closely than either one correlates 
with intelligence; while the correlations with academic standing are 
negligible. That the psychology marks are fairly adequate measures 
of achievement is shown by their correlation of .40 with intelligence. 





1 Tidyman’s data: the judgment “correct’’ indicated correct spelling in 92.7 
per cent of the cases; the judgment “incorrect”? indicated incorrect spelling in 
95.3 per cent of the cases; the judgment ‘‘doubtful” indicated incorrect spelling 
in 72.4 per cent of the cases. 

1 Tidyman’s data, treated in the same way, give an average spelling con- 
sciousness of 79.5 per cent for his school children in his somewhat easier text. 
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TaBLe V.—CoRRELATION COEFFICIENTS, PRODUCT-MOMENT METHOD. Sixty- 
SEVEN Cases, Except IN CORRELATIONS WITH ALPHA, WHERE N 1s 36. P.E. 
or Eacu CoEFFICcIENT IS BENEATH IT IN UppreR Ricut Har or TaBLe, 

AND OMITTED In OrneR Har. ALi CoEFFICcIENTs PosITIvVE 

















P Spelling Alpha _ #/| Introductory 
cae t © | conscious-| _intelli- psychology 

y ness gence test grades 

Spelling ability...........ccsc0. bee .68 .35 .14 
04 10 .08 

Spelling consciousness........... .68 ditt .30 ll 
.10 .08 

Alpha intelligence test........... 35 .30 ited .40 
.09 

Introductory psychology grades. . 14 ll .40 

















It is at least possible that spelling consciousness may be cultivated. 
The student who enters college is obliged to acquire a vocabulary 
considerably in excess of the vocabulary of the elementary school 
class in spelling. Whether he learns to spell the new words or not 
rests mainly with him, since college instruction in spelling is usually 
incidental. He is seldom, however, compelled to spell new and diffi- 
cult words without opportunity to consult an authority. It may be 
that what he really needs is to recognize his own limitations with 
regard to words infrequently used rather than to spend time in learning 
to spell a new and extensive list of seldom used words. He obviously 
does not learn to spell all the new words he uses, and this study shows 
that he does not know his own deficiencies well enough to lead him to 
use a dictionary as often as he should. College students have a false 
confidence that they can select the words they do not know how to 
spell. In actual fact, they miss about half of them (52 per cent 
according to Table III). 

S. A. Courtis points out in his recently published course of study 
in spelling (‘‘Teaching Spelling by Plays and Games,” 1922) the 
necessity for developing a spelling conscience, 7.e., a desire to spell 
correctly. He sets up as an objective ‘‘a realization of the attitude of 
‘society towards misspellings, so that the slightest doubt in regard to 
the correct spelling of a word will operate to make them (the pupils) 
look up the spelling in a dictionary.”” This is good, but an educated 


































M2 


rae me, a 
= 


tee > 


+ bal 


sf athe ~ 


a ie te 
ee ed 
ms ee > 





hee Sy Z “ R Sn natn oa A 




















t 
t 
t 


44 The Journal of Educational Psychology 


conscience is also needed. If doubt is not felt when it should be, the 
dictionary will not be used. 

How can a spelling consciousness be developed? The final answer 
to this question doubtless depends upon the working out of a complete 
psychology of spelling consciousness. This should answer such 
questions as: ‘‘On what imaginal factors does recognition of a spelling 
as correct or incorrect rest? How do children and adults detect the 
slight differences between correct and incorrect spellings?” In the 
meantime, only tentative answers can be given. In part, spelling 
consciousness is developed incidentally to general spelling ability. The 
better spellers tend to have a higher degree of spelling consciousness 
(r is .68). There are many exceptions, however; .68 is only a moder- 
ately high correlation. Perhaps spelling consciousness would be 
increased by emphasis on dictionary work. It would be interesting, 
and not impracticable, to study the effect of various courses of study 
(e.g., the Courtis course referred to, with its strong emphasis on motiva- 
tion) on spelling consciousness. Teachers of spelling, once they 
see the need, will probably invent devices for cultivating the attitude 
of doubt with reference to the spelling of new words in the pupil’s 
vocabulary.' 

This study has presented a method for measuring spelling con- 
sciousness. The spelling consciousness of college students is found 
to be generally low, with large individual differences, for a difficult list 
of words which are potentially within the subjects’ writing vocabulary. 
Spelling consciousness correlates .68 with spelling ability under these 
conditions, and to a low degree, though positively (.30), with general 
intelligence. The need for cultivating spelling consciousness is 
emphasized by these facts, but the development of teaching proce- 
dures toward this objective awaits both the formulation of a 
complete psychology of spelling consciousness and an empirical attack 
by classroom teachers who see their pupils’ need for greater spelling 
consciousness. : 





1 See Lull, H. G.: A Plan for Developing a Spelling Consciousness, Elementary 
School Journal, Vol. XVIII, 1917, pp. 355-361; also Tidyman, references cited. 
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AN IMPROVED RATING SCALE TECHNIQUE 
PAUL HANLY FURFEY 


Catholic University, Brookland, D. C. 


It is well known that the reliability of a test increases as its length 
is increased, a fact which is expressed by the Spearman-Brown formula, 
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in which r,, is the reliability coefficient of a test composed of n parts of 
equal length, the reliability of each part being r;;. This fact has its 
application to the construction of rating scales when the number 
of judges is increased to increase the reliability of the scale. It ought 
to be possible, however, to obtain the same result in yet another way. 
Not only may the number of judges be increased but the number of 
judgments which each makes may be increased. This is easily 
accomplished by analyzing the trait to be rated into several sub-traits, 
by having the judge rate all these sub-traits separately and then 
combining these separate ratings into a final score. This is quite 
comparable to the process of measuring intelligence by measuring 
separately a number of abilities which are believed to correlate highly 
with intelligence and then combining the separate results into a final 
score. 

The writer had occasion to test the value of this technique in con- 
nection with a rating scale for developmental age.' As a criterion, 35 
boys who had been studied intensively were selected and divided into 
seven groups representing increasing developmental ages. A large 
number of traits were tried out against this criterion and 18 traits 
were finally selected as showing promising correlation with develop- 
mental age. They included such qualities as changeability of mood, 
reaction to authority and type of games played. These traits were 
reduced to a graphic rating scale of the sort made familiar through 
the work of Freyd, Cady, and others. The rater expresses his opinion 
by placing a cross at some point of a line whose extremes represent 
extremes of the trait. The following will serve as an illustration: 








1 By developmental age, as here used, is meant the factor underlying the 
differential reactions which distinguish older from younger children in their play 
habits, attitude towards authority and general social adjustment. 
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C. Does he prefer individual-type games to team games (e.g., tag 
rather than basketball, “I spy’’ rather than football)? 


Always prefers individual games; never plays team games 
Slight preference for team games 
Usually prefers individual games 
Equally ready to play either type 
Usually prefers group games 
Slight preference for individual games 


Plays group games exclusively 


The scale was then tried on 75 boys in a group with which the 
Department of Sociology of the Catholic University is doing experi- 
mental recreational work. The mean age of the subjects was 169.50 
months and the standard deviation was 15.54 months. A group was 
purposely chosen with a rather large age range for reasons outside the 
scope of this paper. Of course this fact increases the reliability coeffi- 
cients but it does not affect the purpose of this paper which is to com- 
pare the reliability of a team of ratings with the reliability of the 
separate ratings. The judges were two recreational leaders who had 
had exceptional opportunities to know the boys. 

The graphic results were first transmuted into numerical scores 
and then the latter were rendered comparable by reducing them to 
x/sigma measures. A developmental score was then computed for 
each subject by adding together his scores on the separate traits. 

There are various ways of estimating the reliability of the scale. 
The usual way is to correlate the ratings of one judge with those of 
the other and take the result as an approximation of the reliability 
coefficient of each judge. In the present case this procedure yields 
a coefficient of .888. If this is taken to represent the reliability of 
each judge, then the reliability of their combined ratings is .940 as 
computed by the Spearman-Brown formula. 

There is another way to estimate the reliability of the scale, namely, 
by estimating the reliability of the 18 separate judgments which each 
judge is called upon to make in the case of each boy. The average 
intercorrelation of these 18 trait ratings may be taken as an approxi- 
mation of the reliability coefficient of each trait-rating. This may be 
calculated expeditiously by Kelley’s formula 171. Having thus calcu- 
lated the average reliability of the 18 trait-judgments taken separately, 
we may apply the Spearman-Brown formula and estimate the reli- 
ability of the entire battery of 18. This method of calculating the 
reliability coefficient is not free from certain fallacies, but it serves as a 
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useful check on the coefficient obtained by the other method and has 
the further advantage of giving a separate coefficient for each judge. 

Applying this method to the ratings of Judge I gives an average 
intercorrelation between his estimates of the 18 traits of .475. Taking 
this to represent the reliability of a single trait-judgment, the reliability 
of Judge I’s composite rating based on the 18 traits is .945 by the 
Spearman-Brown formula. Similar results for Judge II are .437 and 
.936 respectively. Taking .940 for the average reliability of the two 
judges we may estimate the reliability of their combined scores and 
the result is .969. 

Two methods of estimating the reliability of the team of ratings 
give coefficients of .940 and .969 respectively. It now remains to 
determine whether this represents a real improvement over the old 
method according to which each judge gave a singie rating on the 
trait to be measured. The ability of the judges to rate single traits 
may be measured by correlating the rating of Judge I with the ratings 
of Judge II on each of the 18 separate traits. This was done with 
the results shown in Table I. The correlation on trait N (‘‘ Does he 


TaBLeE I.—CorRELATIONS OF RATINGS OF JUDGE I AND JupGE II on SEPARATE 


TRAITS 
A B C D E F G H I J 
.758 .593 .556 744 .618 .777 4.865 #3 ~«.839 .607 .605 
K L M N O P Q R Mean 


.695 711 .798 .894 .817 .853 .672 .714 .695 


enjoy dramatic type of play?’’) is spuriously high. This was a trait 
which was conspicuously present in one or two boys and absent in all 
the others. With the exception of two or three traits like this the 
‘coefficients confirm the results of other investigators that judges can- 
not, on the average, agree in their estimates of single traits more 
closely than is expressed by a correlation of .7; and this is true even 
when the judges know the subjects very well and when the range of 
the distribution is considerable, as in the present case. The mean 
correlation in the case of the present two judges is .695. This may be 
interpreted as a reliability coefficient and represents the reliability of 
each judge’s estiraate of a single trait. 

It now remains to answer the question asked at the beginning of 
this paper: Is the reliability of a rating scale increased by teaming 
traits into a battery of ratings as the Spearman-Brown formula would 
lead us to expect? The answer is obtained by substituting .695 for 
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ryz in the formula and 36 for n, since a scale infolving 18 judgments by 
each of two judges is 36 times as long as a test involving one rating by 
one judge. The result is .987 and is considerably higher than the 
reliabilities obtained by the scale which were .940 and .969. This 
discrepancy may be explained in various ways. Possibly there is a 
large halo effect in the judge’s ratings, so that instead of making 18 
separate judgments the judge had a tendency to make a general 
estimate of the boy’s maturity and mark all the scores high or low as 
the case might be. However, even if the improvement in reliability 
falls short somewhat of theoretical expectation it is still very consider- 
able. A technique which can improve the reliability of a rating scale 
from less than .70 to at least .94 ought to be useful to investigators. 


SUMMARY 


1. A new form of rating scale is described in which the score of 
each judge is assigned on the basis of a number of ratings instead of on 
a single rating. 

2. A new method of calculating the reliability of such a scale is 
proposed. 


8. In the particular. case upon which the writer has been working, 


this technique increased the reliability of the raters’ judgments from: 


less than .70 to at least .94. 
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“STATISTICAL ISSUES” 
KARL J. HOLZINGER 
University of Chicago 


This tardy note is due to the fact that the writer often fails to read 
copies of the Journal until they are a year or so old, a practice that 
cannot be too severely condemned. In spite of this tardiness, however, 
it seems pertinent to rise to a certain ‘‘Statistical Issue” set forth by 
Dr. Raymond Franzen in the September number of the Journal of 
Educational Psychology for the year 1924. 

Among other issues (all of which we have not examined as yet) 
Dr. Franzen takes exception to a formula by Monroe who proposed at 


oV1—r 
M 








one time the quantity as a measure of the reliability of tests. 


To illustrate the undesirability of this formula, Franzen says on 
page 380: ‘“‘Suppose that seven children take a test with 10 questions 
and get consecutive scores from 1 to 7 out of the 10 correct. The 
mean is 4 and the standard deviation is 2. Add 5 very easy questions 
to the test, questions which each of the seven children can get correctly, 
so that the scores now range consecutively from 6 to 12. The mean is 
now 9 and the standard deviation is still 2. The standard error of 
measurement ¢./1 — r remains the same before and after the addition 
of the five easy questions. Suppose the correlation between Form 1 
and Form 2 of this test to be .64. The standard error of measurement 
is 21/1 — .64 or 1.2 before and after. The ratio proposed by Monroe 


is “2 = 0.3 before the five questions were added to the test, and 


‘ = 0.133 after the five questions were added. Still, adding the 
five questions has not changed the test’s reliability since all five new 
questions were answered by all children. Radical differences in con- 
ception of the reliability of a test judged by the use of a ratio result 
from changes in the location of the zero.” 

Now this is a faulty sort of argument because when a test is length- 
ened its reliability does ordinarily increase, and the problem is merely a 
trick example of no practical significance. Indeed, a similar example 
may be concocted to prove the result the other way round. Let 10 
more problems as hard as the first be adced so that the children get 
scores of 2, 4, 6, 8, 10, 12, and 14 instead of 1, 2,3,4,5,6,and7. The 
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mean is now 8 and the standard deviation is 4. The reliability 
coefficient, .64, remains unaltered when the scores are exactly doubled. 
It now remains to test out the two formulas. The standard error of 
measurement is 1.2 before lengthening the test and 4./1 — .64 = 2.4 





afterward. Monroe’s formula, on the other hand, gives “ = and 


: x or 0.3 both before and after the increase in score, and is therefore 





the better method. This argument is as good as Franzen’s—and both 
are nonsense. 


A more fundamental defect in Franzen’s argument arises from his 


interpretation of the quantity 7 This, except for the coefficient 100, 


is the well known coefficient of variation introduced by Karl Pearson 
in the Philosophical Transactions of the Royal Society, Vol. CLXXXVII, 
1896. We are told by Franzen on page 381 of his article that “‘ when 


zeros are not properly located, we are not justified in using 7 in any of 


its forms.”’ He then concludes on the next page that, ‘‘The only use 


that’ 7 has in the measurement of human abilities is as a location of 


zero.”’ Aside from the logic of the above statements, there appears to 
be considerable misapprehension regarding the meaning of a Coefficient 
Variation. 
Taking this coefficient in the form, 
Co 


Vea 


it is clear that this quantity is equivalent to ” 2(z,) /N while the 





2 
standard deviation is , [se Thus ¢ is the root mean square variation 


of X while V is the root mean square variation of X /M, both about the 
mean. The former measure of variability takes into account only the 
size of the deviations from the mean regardless of the magnitude of 
the average itself, while the latter gives a measure of dispersion taking 
into account the size of the items relative tothe mean. To assert that 
one of these measures is right and the other wrong, is to argue that 
only absolute and not relative comparisons are permissible. 

On the last page of his note Franzen concludes that ‘‘In measure- 
ments where there is an absolute zero such as linear space, variability 
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independent of units may be expressed by this ‘coefficient of vari- 
ability,’ but our data have no such absolute zeros.”” Whatever Fran- 
zen may think he means by the absolute zeros of linear space, his 
argument is still inconclusive because if one is not permitted to find the 
standard deviation of X/M lacking an ‘‘absolute zero,’”’ he should by 
the same reasoning not be permitted to use the mean itself. The 
problem appears to be approaching the field of relatively, but to fix 
ideas let us conclude by absolutely disagreeing on these two statistical 
issues. 










Si OEE SE RE SEE: ETS OTT MS REEL A Re aR 






SS te 
- —s * - 
z " + ne : 





fem |) 


THE NEGATIVE SUGGESTION EFFECT OF TRUE- 
FALSE EXAMINATION QUESTIONS! 


H. H. REMMERS AND EDNA M. REMMERS 
Purdue University 


One of the objections frequently raised against true-false state- 
ments is the old pedagogical maxim that no false association should 
ever be presented to the learner. Certainly it would seem on the face 
of it that such a correlary of the laws of learning as “ put together that 
which should be learned together, and keep apart that which should 
be learned apart’ is violated by the use of this test technique. 
Whether one couches the laws of learning in terms of ‘‘laws of associa- 
tion” or ‘‘complexes’’ or ‘‘conditioned responses” or “‘Gestalten,’’ 
it is obvious that the opportunity for forming wrong connections 
exists in approximately 50 per cent of the test items, and many persons 
presumably competent to judge condemn the true-false test on a 
priort grounds. One might retort to the a priori objector that the 
time for forming proper associations is before, and not during the test. 
The issue is sufficiently serious, however, to warrant an experimental 
attack on the problem. The experiment here described was designed 
to answer for one kind of material and for college sophomores the 
specific question: do true-false tests tend to leave a residue of false 
associations? 

PROCEDURE 


The following requirements relative to material seemed desirable, 
if not essential, in order to have conditions as nearly like ordinary 
classroom procedure as possible, and yet safeguard the experiment 
from various pitfalls: 

1. A reading selection should be chosen that could be covered by 
the slowest reader in an ordinary class period. 

2. The material should be such that it would be extremely ay 
that the subjects had ever read it before. 

3. The selection should be interesting. 

4. It should be possible to formulate a sufficiently large number of 
true-false statements over the material to give a reasonably reliable 
test, 7.e., at least 100 statements. 

5. The selection should be such that no student would be likely to 
look up the selection for the purpose of rereading it. 


1 The authors are indebted to Messrs. A. Grant and L. L. Carter for the scoring 
of the tests and the tabulation of the scores. 
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A passage from Tacitus on the ‘Customs of the Germans” was 
finally selected as meeting these specifications. From this selection 
121 each of true-false statements and recall (completion) test items 
were formulated, each true-false statement being paralleled by a 
recall item calling for exactly the same bit of information, idea, or 
inference. The following sample illustrates the type of test items 
selected: , 


“Punishment was inflicted upon an unfaithful wife by the whole 
GI ivsib.oo bawenn ss hcckeun Shean T-F 
Punishment was inflicted upon an unfaithful wife by the....... 4 


The reading selection was then mimeographed and passed out to 
two groups of students (hereafter called Groups I and II) equated 
on the basis of the Otis Self Administering Higher Examination. The 
method of equating the two groups was that of arranging the scores 
in rank order and selecting alternate scores for each group. The 
PE,,, for a single score on the Otis test was calculated and found to 
be 4.86. In selecting the two groups as described it was found that 
in no pair of students was the score difference in excess of this quantity. 
In all, 136 usable cases, 68 in each group, were obtained. 

The students were informed in advance of the day of beginning the 
experiment that an investigation concerning the relative difficulty of 
the two types of examination questions was to be made, with a pos- 
sibility that in the future their test scores in psychology might be 
weighted in accordance with the findings of the experiment. Partic- 
ular pains were taken to impress them with the fact that in order to 
have a fair basis for such weighting it would be necessary for each to 
do his best in the experiment. The students were further informed 
that the experimental results would have no direct bearing on their 
class grade. Considerable interest in the experiment was shown on 
the part of the students. 

At the next meeting of the class the mimeographed reading selection 
was passed out with the injunction that everyone study it as carefully 
as possible, in order that he might do well on the test that was to be 
given over the material at the next class meeting. On the day of the 
test Group I was given the true-false items and Group II the recall 
test. So far as the students were aware, this ended the experiment. 
Approximately four weeks later, however, they were again tested 
without warning, and this time the type of test was reversed, 7.e., 
Group I took the recall, Group II the true-false test. All the papers 
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were then scored. In order to keep the subjective element involved 
in the scoring of the recall test at a minimum, this test was scored 
by one person for both groups. Various investigators such as Ruch, 
Toops, Brickley, and others have shown that validity and reliability 
of recall tests are ordinarily high when compared with other types of 
objective examinations containing the same number of items. 


THe Data 


Table I gives the major results of our experiment. 


























TABLE | 
Group I 
Arithmetical |, sistribution| ‘say N 
mean 
Sedat hile tind HEY awa eee bGes <4 71.16 19.81 2.40 
Re ON eee ae een a ee ee 106.18 10.08 1.21 68 
Sum of TF + recall............... 177 .34 24.94 3.02 
Group II 
OO et eT Of ern | 67.88 19.38 | 2.35 
oa ee 105 .00 15.44 1.75 68 
eee Ge er ROE, .. occas cece 172.88 32.556 3.945 





The logic on which we shall base our conclusions is as follows: If 
the average score of Group I be equal to, or greater than, that of Group 
II, it follows that the taking of a true-false test can have had no delete- 
rious effect upon the formation of correct associations. The difference 
of the means of the total score on both tests for the two groups is 
4.46 + 3.37, 7.e., 1.32 times the PE of the difference. The chances 
that the difference is significant are only 1.6 to 1. To the extent that 
this difference is greater than chance our results favor the true-false 
type of examination, for it seems that under the conditions of the 
experiment, which approximated those of ordinary classroom pro- 
cedure, the chances are 1.6 to 1 that taking the true-false test first 
actually taught the subjects something about the material. Insofar as 
true-false statements create in the examinee a critical attitude toward 
all sorts of propositions, whether they be presented by William J. 
Bryan, a classroom instructor, or the highest scientific authority, 
a rather good pedagogical case can be made for this type of test. A 
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“true or false?’ attitude toward the assertions of montebanks in 
general is certainly desirable, and in the long run can only aid science 
and the “facing of reality’’ which it implies. 

A retabulation of our data as shown in Table II facilitates the 
interpretation of our results and also makes possible some interesting 
comparisons. 











TABLE II 
| Approxi- 
M Pp 
restate uted Differ- PE mate 
Type of test ence of | differ- | chances of 
I II means ence | significant 
i _| difference 
MA shiae oa cutousddseees aH Oe? Gee ae 3.35 1 tol 
Ns aan et kaass da bee ae 106.18 | 105.00 1.18 1.44 ltol 
Sum of TF recall............. 177 .34 | 172.88 | 4.46 3.37 1.6 to l 




















It will be noticed that even after an interval of approximately four 
weeks there is no decrease in average score on the recall test when we 
compare the two groups. One might argue from this that the taking 
of a true-false test favors delayed recall. 

Another item of interest is the significantly higher score on a recall 
test as compared with recognition when the latter is corrected for 
‘guessing’? by scoring rights minus wrongs. This corroborates the 
results obtained by the senior author in an earlier experiment.' It is 
of course impossible to correct recall tests for “‘guessing.’’ It is also 
obvious that a certain amount of guessing is possible. The example 
cited on page 53 will serve to illustrate. It needs no argument to 
show that the examinee, knowing that he will not be penalized for a 
wrong answer, even though he does not recall the proper phrase, will 
guess at “family,” “husband,” “relatives,” ‘‘authorities,” etc. What 
the mathematical chances of a correct response are in any case, it is 
manifestly impossible to say. It is our judgment, however, that this 
factor is insufficient to account for the observed difference. The 
results of three years of use of this sort of test instruments in the 
administration of courses in psychology inclines us to the belief that 
when a difficult true-false item is met it is likely to be answered, liter- 
ally or figuratively, on the basis of tossing a coin. Thinking is 





1 Remmers, H. H., et al. An Experimental Study of the Relative Difficulty of 
True-false, Multiple Choice and Completion Types of Examination Questions. 
Journal Educational Psychology, Sept., 1923. 
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unpleasant, and very few enjoy it for its own sake. Since there are 
only two possibilities, a student’s inherent optimism is likely to lead 
him to assume that he will probably guess right. 

With recall items, however, the case is somewhat different. Since 
the case is not one of “‘either or,’’ and since there is probably consider- 
able internal opposition to leaving any items unanswered, the condi- 
tions for ‘‘thought”’ are better—there is apparently more of a difficulty 
to overcome, and asja result relatively more effort is made to recall. 
Students when askec which of the two types of tests they prefer always 


vote overwhelmingly for the true-false type on the ground that it is 
“‘easier.”’ 


CONCLUSIONS 


1. With the kind of material and subjects used there is no evidence 
of negative carry-over from the taking of true-false examinations. 

2. As between the two types of tests used in this experiment, the 
initial application of true-false tests seems slightly to favor delayed 
recall. 

3. Scores on completion tests are significantly higher than on 
true-false tests when the latter are corrected for guessing. 


4. Carefully controlled experiments with other types of material 
are desirable. 
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A NOTE ON THE RELATIONSHIP BETWEEN THE 
NUMBER OF MONTHS OF STUDY AND 
PROFICIENCY (GEOMETRY) 


DONALD SNEDDEN 
Cooper Union 


A study of current requirements for admission to some American 
colleges would convince the naive that it was assumed that equivalence 
of length of time spent in studying a subject produces equivalence of 
proficiency. That this is not really the case is generally recognized, 
but the extent to which it is not the case is, perhaps, not fully appre- 
ciated by our academic administrators. 

The following data bears on the matter. Four hundred forty-one 
of the applicants to the Cooper Union night courses were, in Septem- 
ber 1925, given Parts I and II of the Hawkes-Wood Placement Test 
in Plane Geometry. All of the 441 boys had studied geometry in High 
Schools for varying lengths of time. Each boy noted on his test paper, 
at our request, the number of months he had studied plane geometry 
in school. The correlation between the number of months plane 
geometry had been studied and the score obtained on the test was 
only .218. Additional data for interpretation follows: 


Mean number of months studied........................ 9.043 
Sigma number of months studied........ ............... 3.60 
Mean score on geometry test.......................200.. 34.375 
Sigman soore on geometry test... .... 2... wc ce cee 13.35 


If we assume the reliability of the statements by the boys of the 
number of months they had studied plane geometry to be 1.00 (which 
of course it is not, exactly) and take the reliability of Parts I and II 
of the Plane Geometry test given (.798 by the split halves and Brown 
formula method, for this group) we find that, corrected for attenua- 
tion caused by the unreliability of our measure of geometrical ability, 
the correlation between the number of months studied and the score 
obtained is raised only to .244. So that, even if we had measured the 
geometrical ability with a perfect instrument, the relationship would 
still be, for purposes of prediction, quite negligible. An ‘‘r’’ of .244 
has a “‘k”’ of very close to .97. Which is only another way of saying 
that the amount of time a person has put on a subject is worth almost 
nothing, in cases of this sort, as an index of his proficiency in that 
subject. 
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NOTES ON ARTICLES IN EDUCATIONAL 
PSYCHOLOGY IN CURRENT ISSUES OF 


@t-~ OTHER MAGAZINES egy, 


REPORTED BY C. 0. MATHEWS 
Graduate Student, Teachers College 
Columbia University 











INTELLIGENCE TESTING 


Some Factors of Effectiveness in Mental (‘‘Intelligence’’) Tests. S. A. Hamid. 
The British Journal of Psychology, Oct., 1925, 100-115. This is a study of the 
effect of numerous factors upon success by an analysis of many thousand test 
results. 

The Worcester Formboard Series. David Shakow and Grace H. Kent. The 
Pedagogical Seminary and Journal of Genetic Psychology, Dec., 1925, 599-611. 
A description is given of a series of small, portable formboards. Tentative norms 
are included. 

Predicting Academic Achivement; A Study in Probability. Curt Rosenow. 
The Pedagogical Seminary and Journal of Genetic Psychology, Dec., 1925, 628- 
636.< This is an attempt to show how accurately academic achievement can be 
predicted by the use of psychological tests. 

A Study of Differences Found between Races in Intellect and in Morality. Kath- 
arine Murdoch. School and Society, Nov. 14, 1925, 628-632. This is the first 
section of a report of a comparative study of the races represented in Hawaii. 
Differences in intellect are here pointed out. 

A Study of Differences Found between Races in Intellect and in Morality. Kath- 
arine Murdoch. School and Society, Nov. 21, 1925, 659-664. Morality and 
musical differences between the races of Hawaii are here reported. 


ACHIEVEMENT TESTING 


Improving Instruction through Point Tests. Josph C. McElhannon. Peabody 
Journal of Education, Nov., 1925, 131-138. The results of several college classes 
on point tests are given together with a discussion of the values of this method of 
measuring achievement. 

A Scale for Scoring Tests with Alternative Answers. Gilbert J. Rich. Ameri- 
can Journal of Psychology, Oct., 1925, 597-600. This is a method of scoring 
alternative answer tests intended to correct for the chance factor by weighting 
the successive numbers of the test in inverse proportion to the probability of their 
being reached by guessing. 

Efficiency in Learning and the Accomplishment Ratio. Florence L. Goodenough. 
Journal of Educational Research, Nov., 1925, 297-300. This paper concludes that 
it is not valid to compare the learning efficiency of children who differ widely in 
intelligence by the use of the accomplishment ratio. 
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LEARNING AND THE PsycHoLocy OF ScHOooL SuBJEcTs 


What Words Should Children Be Taught to Spell? II. Vocabularies of Various 
Types. Frederick S. Breed. The Elementary School Journal, Nov. 1925, 202- 
214. It is pointed out that the best spelling vocabulary at present is that based 
upon both the written work of children and adult correspondence. 

Functions of Flash Card Exercises in Reading; An Experimental Study. Arthur 
I. Gates. Teachers College Record, Dec., 1925, 311-326. The effect of the use 
of flash cards upon silent reading ability is shown. 

Flash Cards as a Method of Improving Silent Reading in the Third Grade. Robert 
E. Scott. The Journal of Educational Method, Nov., 1925, 102-112. Both rate 
and comprehension in silent reading show much more rapid increase in tlie classes 
using flash cards than in the control classes. 

An Experimental Study of the Merits of Extensive and Intensive Reading in the 
Social Sciences. Carter V. Good. The School Review, Dec., 1925, 755-770 
“The results of this investigation show that the character and length of reading 
assignments should vary with the purpose of reading.” 

Adding Up or Down; A Discussion. B. R. Buckingham. Journal of Educa- 
tional Research, Nov., 1925, 251-261. The opinions and preferences of 493 
students, most of them with teaching experience, are presented with suggestions 
for an experimental attack of the problem. 

Standard Test Lessons in Reading. William A. McCall and Lelah Mae Crabbs. 
Teachers College Record, Nov., 1925, 183-191. These test lessons are described 
and the procedure for their use is outlined. 


CHARACTER AND PERSONALITY 


Character and Temperament Tests. Mary Collins. The British Journal of 
Psychology, Oct., 1925, 88-99. The author considers the respective values of the 
Kohs, Brotemarkle, Pressey and Downey Tests. 

A Method of Measuring the Emotional Maturity of Children. Othniel R. 
Chambers. The Pedagogical Seminary and Journal of Genetic Psychology, Dec., 
1925, 637-647. A differential unit for use with boys is developed with the Pressey 
X-O tests. Norms are given and their diagnostic values discussed. 

A Rating Scale for Practice Teachers. J. 8S. Kinder. Education, Oet., 1925, 
108-114. The author presents a copy of the scale and discusses its reliability and 
practical value. 

MISCELLANEOUS 


Agnes; A Dominant Personality in the Making. Helen T. Woolley. The 
Pedagogical Seminary and Journal of Genetic Psychology. Dec., 1925, 569-598. 
This is the first of a series of case studies from the records of the Merrill-Palmer 
School. 

Physical and Mental Measurements of Fraternal Twins. L. A. Averill and A. D. 
Mueller. The Pedagogical Seminary and Journal of Genetic Psychology, Dec., 
1925, 612-627. The results of 10 pairs of twins on a number of intelligence and 
achievement tests and their physical measurements are here reported. 

The Selection of Bright Children for Special Classes. A. Scott Lee. The Ele- 
mentary School Journal, Nov., 1925, 190-198. The results are given of a 
questionnaire study of cities in the United States with a population of 100,000 
or more. 
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Teacher’s Problems and Courses in Educational Psychology. Dean A. Worcester. 
Educational Administration and Supervision, Nov., 1925, 550-551. A study of 
101 students in Educational Psychology reveals their general opinion to be that 
Educational Psychology is too theoretical and that they value more, specific topics 
of relatively immediate bearing upon every day school problems. 

The Correlation between College Lecture Notes and Quiz Papers. C.C. Crawford. 
Journal of Educational Research, Nov., 1925, 282-291. The author concludes 
that it seems highly probable that the note taking practices of students and the 
studying of them before a quiz results in better quiz grades. 

Bibliography on Psychological Tests and Other Objective Measures in Industrial 
Personnel. Grace E. Manson. The Journal of Personnel Research, Nov.—Dec., 
1925, 331-338. Many references in this bibliography are related to educational 
psychology. 

Pupil Reaction to School Reports I. W. A. Barton Jr. The School Review, 
Dec., 1925, 771-780. Partial results are given of a questionnaire study of 1513 
high school pupils, designed to discover pupils’ opinions and the effects of school 
reports. 
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A History or INTELLIGENCE TESTING 


Early Conceptions and Tests of Intelligence, by Joseph Peterson. 
Yonkers: World Book Co., 1925. Pp. XIV + 320. 


Intelligence testing can no longer be regarded as a passing fad, as a 
“‘cause’’ championed by a few misguided enthusiasts, for it now has 
a substantial history. The appearance of such a history is opportune 
and the history itself will lend dignity to a field of psychology, which 
is even yet regarded by some psychologists as very unscientific and 
undignified. It will also help to give the workers in this field a better 
background for their work. 

The writing of a history that is not a mere chronological list of 
facts but which attempts an interpretation of such facts is always 
difficult. The author has, however, avoided controversial points as 
far as possible and has presented a very fair and straight-forward 
account of how intelligence testing arose. Like all good histories, 
this history likewise goes back to Aristotle. Four chapters are devoted 
to the early philosophical conceptions of intelligence. Then we are 
given an excellent account of the sporadic tests which appeared in the 
latter half of the nineteenth century. These scattered attempts at 
testing would seem to have had within them the germ of the present 
intelligence tests. The appearance of Binet at this time, however, 
turned the tide toward the type of test suggested by him. The 
remainder of the book, about two thirds, is devoted to Binet’s work 
and his influence upon the early work in this country. No attempt is 
made to go beyond Binet’s death in 1911. That two-thirds of the 
book should be devoted to Binet is wise, for the modern tester is apt 
to forget the great debt that we owe to him. There is no more sug- 
gestive writer than Binet, and there is none who is less known among 
American students of psychology. It is well that we have this well- 
written and detailed account by Peterson. Before attempting any- 


thing new in psychological testing, it is good to find out first what 
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Binet did or thought about it. Not only is he among the leaders of 
intelligence testing proper, but he worked on and published with a 
collaborator several educational scales as early as 1905. He further 
had the idea of educational age at that time. He started rough moral- 
ity or character tests. He made attempts to give intelligence tests 
to soldiers, and if he could have gone on in this direction he would 
undoubtedly have constructed group tests, as he himself suggests. 
The reliability and validity of tests, the correlation of tests with 
teachers judgments are all found in Binet’s work. 

The reviewer feels that the author has written an excellent account 
of Binet’s work. He doubts whether some of the early chapters have 
much bearing on the problems of intelligence testing. The book as a 
whole, is invaluable. It will form a welcome addition to the growing 
library of sound books on the measurement of intelligence. 

R. PINTNER. 
Teachers College, 
Columbia University. 





o _ AN HIsToRIAN ON SocraL PsycHOLoGy 


Psychology and History, by Harry Elmer Barnes. New York: The 
Century Company, 1925. Pp. 193. 


Harry Elmer Barnes is a young historian who owes his rise on the 
historical market chiefly to his scientific attitude toward the origins of 
the recent war. ‘‘Psychology and History,’’ the book under considera- 
tion, is the reprint of a chapter from his New History and the Social 
Studies. It is an interesting and valuable piece of work in that it 
gives a succinct statement of the more important developments in 
the field of social psychology. The brief synopses of the contributions 
of James, Hall, Thorndike, McDougall, Trotter, Watson, and other 
psychologists to the theory of sociology are excellent. The differences 
between the more orthodox group which bases its social psychology 
on a few general instincts and the behaviorist group which attacks 
the older conception of instinct are well brought out. The statement 
of the case for each man and each group is, on the whole, quite impar- 
tial except for an obvious bias toward the psycho-analysts. 

One feels, however, that Mr. Barnes fails in being sufficiently 
critical. He does not, for example, seem to be entirely conscious of 
the extremely hypothetical nature of much of the work in social 
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psychology. He does not evaluate it with sufficient harshness in the 
light of the scientific method. He accepts it too heartily at its face 
value. This may seem peculiar in an historian who might be expected 
to bring to bear on the work of the social psychologists the critical 
attitude developed in his own field. The phenomenon, however, is 
not uncommon. Certainly the psychologists in penetrating the social 
sciences have been less rigid in their methods than they are in their 
particular field of individual psychology. 

Perhaps one of the most important contributions Mr. Barnes could 
have made by a more critical method would have been in an analysis 
of the determining factors in history. There are two important points 
of view on this question. One is that psychological factors are the 
chief determinants. The other is that factors outside man’s psychol- 
ogy, such as geographic and economic factors, are the chief deter- 
minants. The author does, indeed, touch upon this problem at a 
number of points, but he never treats it systematically. He is him- 
self largely committed to the principle of psychological determinism. 
He says: 

“The essence of the psychological interpretation of history is the 
thesis that the determining factor in historical development is the 
collective psychology of an era and of a given cultural group. Its 
adherents rightfully claim that it is not only the most scientific 
but also the most all-inclusive of the various types of historical 
interpretation.’ 

A little later, however, in discussing Puritanism, he says: “Some 
might urge—that the American has not been industrious because he 
is a Puritan neurotic, but has become the typical Puritan because 
industry and self denial were essential to the success of an ambitious 
but impecunious population in a new and undeveloped country. 
Doubtless there is much to be said for both viewpoints.” 

In general, however, he fails to evaluate the two viewpoints in 
relation to each other. He does not, on the whole, consider the theory 
of psychological determinism, which he espouses, in the light of well- 
known and well-substantiated historical facts. He gives an emphasis 
to the psychological determinants of history which he may be able to 
confirm with facts—but which he has not yet confirmed. Forexample, 
he gives great weight to the peculiarities of Hamilton and Jefferson as 
causal factors in the political course of events in this country in the 
late eighteenth and early nineteenth centuries. He goes so far as to 
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“To a very large degree our strong federal government has been 
but a collective appropriation of the authority-loving and reality- 
conquering personality of Alexander Hamilton.” 

So? And had the economic interests of the founding fathers, as 
they are shown by the careful study of Beard, nothing to do with this 
same strong federal government? Would it, then, not have existed if 
Hamilton had not been an extrovert? We need not take it for granted 
that Mr. Barnes cannot prove his point, but certainly he has not done so. 

Toward the end of the discussion he makes the statement that 
“even before the psychology of the unconscious was generally under- 
stood many students of economic history had ceased to claim a primacy 
for the economic factor as compared with all other elements which 
made for progress and development, and had held that the psychic 
factor was obviously the dominant element, though economic proc- 
esses and institutions might exert a most mportant influence upon the 
psychological factors. In an able and suggestive paper Professor 
Willam F. Ogburn has carried this line of analysis still further and 
has pointed out the vital significance of unconscious psychological 
processes in the field of economic motives and activity.” 

In further amplification of this idea, Mr. Barnes points to examples: 

“How far, for example, was the austere impurity complex of the 
‘glacial age’ of New England Puritanism a psychic compensation for 
economic chicanery in smuggling and the rum trade? How far were 
the philosophical discussions and oratorical tirades concerning liberty, 
natural rights and revolution in the period following 1765 a compensa- 
tion for and justification of the prevailing system of smuggling? It 
cannot be without significance that the leading haranguer for liberty 
in Boston was fed and clothed by the leading smuggler, nor that the 
most conspicuous name on the Declaration of Independence was that 
of the most notorious violator of the customs regulations. Again, it 
would be interesting to know why the public statements of the leading 
colonial radicals indicate that their fundamental loyalty to Great 
Britain grew progressively more intense until about July 1, 1776. It 
has long been suspected and recently been proved that the legalistic 
arguments over nationalism and states-rights during the first decade of 
our national history were but the rhetorical drapery which covered the 
economic interests from which Hamilton and Jefferson drew their 
supporters.”’ 

These are very good examples. But of what? Are they examples 
of the fact that the psychology of the colonists and early citizens was 
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the determinant of events in the colonies and in the new country? 
Or are they examples of the fact that these interesting psychological 
states were the accompaniments, the ‘‘drapery”’ of acts motivated by 
economic interests? 

One cannot but hope that Mr. Barnes will follow up his interesting 
and suggestive discussion with a more careful analysis of the problems 
and conflicts arising out of the union of psychology and the social 
sciences. 

ELIZABETH GALLOWAY Woops. 


JOURNALIZED PsYyCHOLOGY FOR THE LAYMAN 


Increasing Personal Efficiency, by Donald A. Laird. New York: 
Harper and Brothers, 1925. Pp. X + 90, $3.00. 


James Harvey Robinson’s Humanizing of Knowledge very persua- 
sively defends the thesis that the interpretation of scientific knowledge 
to the laity in popular language is one of the primary charges upon the 
capital of the time and effort of the scientific fraternity. The book 
here under review is the outcome of an attempt to popularize psycho- 
logical findings in so far as they presumably apply to increasing per- 
sonal (and to some extent social) efficiency. 

The content of the book can be summarized under such captions 
as control of physical environment, economy in acquiring skills and 
knowledge, the improvement of silent reading, fatigue, errors in 
thinking, control of emotions, and the rather large and relatively 
unexplored area of mental hygiene in general. Be it said that the 
chapter headings are better designed to catch the attention of the 
prospective improver of personal efficiency than these captions. 

Dr. Laird has an extremely happy knack of putting into popular 
‘“‘journalese”’ the frequently abstruse findings of the psychologist and 
psychiatrist. Indeed, the specialist in psychology will frequently be 
irritated by the over-simplification and ex-cathedra statement of 
various moot questions. But it is to be remembered that the book 
is written not for the specialist but for the person of very little special 
training of any sort—in short, for the man in the street. The special- 
ist would no doubt find, for example, that the explanation of amnesia 

(of course it is not called that) in six lines leaves too much more 
to be said. Nor is it difficult to predict that a large number of psychol- 
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ogists will bridle at the personification of the “conscious” and 
the ‘“‘unconscious” particularly if they be of a monistic and 
behavioristic bent. The extreme skeptic on the transfer of training 
might easily speculate somewhat cynically on the net measurable 
effect of the study of such a book as this, while the extreme optimist on 
this problem might well expect to quail before the individual who 
really set about improving his efficiency as grimly as he is presumably 
expected to do. Lastly, the philosopher who takes the cover title and 
its accompanying dialogue seriously will probably be disappointed in 
not finding, after all, the destination of man, but only some of oil for 
the bearings of his flivver. The dialogue accompanying one of 
“Hendrik Van Loon’s inimitable sketches is worth quoting: 


Martian Kid: ‘‘ What are those people on earth doing?”’ 
Martian Pa: ‘‘Going.”’ 

Martian Kid: ‘‘Going where?” 

Martian Pa: “‘Nowhere. Just going.”’ 


Even so, I should venture the assertion that many an instructor in 
Psychology I will find it profitable to sketch through this little volume 
for illustrative material with which to entice the jaded mental appe- 
tites of sophomores. There is also little doubt that we have here 
much popular summarization and application of matter psychological, 
together with a great many persuasive figures of speech (ever the aid 
of the expounder of new knowledge) and considerable shrewd observa- 
tion on human nature. 

Attractive novelties are several sketches by Hendrik Van Loon, a 
number of rather ingenious charts and diagrams, and particularly the 
self-testing devices scattered throughout the book. Of these the 
Personal Inventory Test, alias the Colgate Mental Hygiene Test, 
deserves special mention. This is a revision of the Woodworth 
Psychoneurotic Inventory adapted to a graphic scale with an ingenious 
stencil for rapid scoring. ‘This instrument gives promise of being an 
aid to “spotting” incipient psychotic cases. (See the Journal of 
Educational Psychology, September, 1925, for a description, and the 
Journal of Applied Psychology, September, 1925, for a report on 
reliability.) 

The book itself, in form and content, is a good object lesson in 
applied psychology. 

H. H. RemMMeERs. 


Purdue University. 
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MEASURING ATTITUDES AND PREJUDICES 


The Measurement of Fair-mindedness, by Goodwin B. Watson. New 
York City: Teachers College, Columbia University, 1925. Pp. 97. 


Experimenters in the field of character, whether interested in the 
construction of character tests, or in the study of causal factors deter- 
mining character, or in various methods and curricula to influence 
development along social-moral lines, will find in Dr. Watson’s work 
stimulation and suggestion for further research. Ly a series of care- 
fully constructed tests the author has demonstrated his ability to 
measure objectively the presence and degree of prejudice on certain 
present day religious and economic issues. ‘Twelve typical lines of 
bias were studied, including the points of view which would be in 
agreement with economic radicals, economic liberals, economic capi- 
talists, modernists and fundamentalists in religion, Protestants and 
Catholics, persons with strict or loose standards of sex-ethics or 
amusements or similar moral matters, persons fighting for a ‘‘social 
gospel” or those mainly interested in a “ personal gospel”’ and religious 
radicals. 

Reliability of the tests was established by correlating scores on 
alternate items against each other, by correlating three of the forms 
against the remaining three forms, and by correlating each form with 
the total. Six lines of study were carried on to determine validity: 
(1) examination of the tests with reference to what they seem to be 
measuring, (2) correlations between each form of the test and the 
whole, (3) a study of the scores made by individuals known to be most 
fair-minded, (4) a similar study on the reactions of individuals known 
to have established prejudices, (5) similar studies on groups, (6) 
examination of the extent to which the test may measure intelligence 
or opinion rather than prejudice. 

Two methods for scoring the tests were developed, one which 
serves to indicate the general level of prejudice within the individual or 
group, the other, an ‘‘analytical score’’ which indicates along which 
particular religious or economic lines the individual or group is biased. 
Degree, as well as direction, is measured. The analytical score is best 
represented by the “prejudice profile.” Seventeen such profiles for 
individuals and 10 for groups are included in the appendix and show 
graphically certain typical reactions from broad and narrow-minded 
people. They also indicate bias along special lines and reveal dif- 
ferences between contrasted groups of normal school and theological 
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students, old and young men in the ministry, ministers with and with- 
out college and seminary training, Catholics and Protestants, attitudes 
of students before and after a conference, and other paired groups. 
Appreciation of the significance of these tests will come with their 
extended use. The actual material measured will be out of date in a 
few years time, but the technique developed would open a wide field of 
research into other and future issues, where motives, feelings and pur- 
poses are the chief determinants of private and public opinion. 
Guapys C. SCHWESINGER. 





A PHILOSOPHICAL INTROSPECTIONIST SPEAKS 


The Crisis in Psychology, by Hans Driesch. Princeton: Princeton 
University Press, 1925. Pp. XII + 271. 


The author is the distinguished leader of the vitalistic school of 
biologists. His psychological concepts arise from his logic and are in 
accord with his general philosophical system that all science is nothing 
more than a theory of order (Ordnungslehre); hence his procedure is 
analytical and dialectical. His experimental data are meager, for he is 
dealing with postulates prior to experiment itself. For him the 
irreducible and primordial fact is: ‘‘J have something consciously.”’ 
Introspection is his method of study. Of behaviorism he says, “ Not 
to use introspection in ‘my’ psychology (7.e., the psychology distinct 
from that of the other Ego, such as animal, etc.), would be to proceed 
as if I always made use of a mirror in order to see what I might see 
directly—or even worse!’’ In his theory of materials he accepts six 
groups. In naming them he seems, in one case, to acknowledge the 
validity of imageless thought as sponsored by the school of Kilpe and 
the Wiirzburgers, but on page 32 he appears to favor the notion of the 
vicarious function of images (Hollingworth) stating that there exists in 
every case a sensible bearer of a thought. Apparently the incon- 
sistency is unnoticed. 

In lieu of associative affinity which he denies, he holds the concept 
of limiting and directing agents. Similarly, while admitting a ‘“func- 
tional’? dependency between consciousness and certain organic activi- 
tic “ue denies a “causal” relation, and settles the body-mind problem 
by ruling psychomechanical parallelism out on four counts and 
accepting the belief that nature and mind are two spheres of empirical 
existence which are absolutely separated from one another and there- 
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fore are unable to act upon each other in a causal way. He suggests as 
necessary the uses of the phenomena of hypnosis, dissociation 
consciousness, freudism, and the like for an explanation of the organi- 
zation of the mind and for confirmation of his doctrines of the soul, 
freedom, and immortality. He voices the demand that these be 
incorporated in our program of research if psychology is to be a 
complete science. 

This book expresses in simpler form his views presented more 
brilliantly in his Philosophy of the Organism. It is significant, not for 
its content, but for the individualistic viewpoint it represents, and his 
offer of his vitalistic philosophy as a basis for procedure. His final 
chapters are either too far ahead or behind our thinking to win the 
allegiance of many. His vocabulary is his own and readers, at least 
those not grounded in the older philosophical diction, will throw back 
at the author his own accusation that “Language . . . is rather 
more of a handicap than a help.” G. W. H. & E. M. B. 





A PrRoBLEM ATTACK UPON EDUCATIONAL DIFFICULTIES 


Problems of the Teaching Profession, by John C. Almack and Albert R. 
Lang. Boston: Houghton Mifflin Company, 1925. Pp. XII + 
340. 


Among the latest books, are to be found a few which attempt to 
present one or more phases of education as a series of problems. The 
above volume contains a large number of professional and social 
problems pertinent to the teaching profession. The factors involved in 
the solution of each problem are given and in some instances definite 
attempts have been made to set up new techniques of analysis and 
organization. 

It is evident that the purpose of the book has not been that of 
reaching final conclusions. Dogmatic or arbitrary statements are 
few. No doubt the few such statements have been made with experi- 
ence and good judgment as a basis, even though the last work in 
evidence has not been collected. 

The book is divided into 17 chapters. Each chapter deals with a 
phase of the teaching profession. At the end of each chapter there 
is a list of examples and problems and a list of selected references 
related to the chapter content. In all there are 221 examples and 
problems, and 223 selected references. 
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It is the opinion of the reviewer that a book of this sort has a place 
in the educational program of to-day. The significant forward look 
which it seems to stimuiate, originates in a feeling of a need for a 
thoroughgoing job analysis of each aspect of the teaching profession. 

The problem method of attack, with accompanying improvement 
in the solution of each problem as time goes on and evidence is col- 
lected seems to be a desirable and efficient procedure of ultimately 
arriving at the truth. 

It is suggested that administrators and teachers take a little 
‘“‘time out”’ to read this interesting book. 

CHARLES C. WEIDEMANN. 





MEASUREMENTS AND NorRMs IN RELIGIOUS EDUCATION 


The Indiana Survey of Religious Education. Vol. II. Measurements 
and Standards in Religious Education, by Walter S. Athearn. 
New York: Doran, 1924. 


This work brings together and conserves the results of the efforts 
made under the Interchurch World Movement to establish norms in 
religious education. 

The main section deals with the scoring and rating of the buildings, 
equipment and textbooks used for religious education, by utilizing the 
techniques commen in administration of secular education. These 
findings are valid because of the objectivity of the data studied, and 
the section represents, therefore, an essential contribution to the 
scientific trend in education of all kinds. 

The value of Part Four is not so clear. It offers a series of tests 
of teaching in Sunday Schools which have been experimentally 
devised, and which seek to reveal the person’s Bible knowledge, 
religious beliefs and ideas and conduct under typical situations. 
The tests revealed, for example, that Bible knowledge does not func- 
tion in the field of ethical judgment. Then what is religious 
education? 

Finally there is offered a composite index for the rating and com- 
parison of local church schools, in an attempt to present in a single 
view the educational status of the local church. Such effort is grati- 
fying indeed to those who look toward a more scientific control of 
human behavior through the religious agencies. The authors acknow- 
ledge the difficulties involved and the tentative and pioneer aspects of 
their work. 
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But after all what is the real value of “consensus of opinion” upon 
which the scales are constructed? Is it likely to be any more reliable 
than that of the inquisitors of Spain? But such a criticism can be 
applied to any use of the jury method in the determination of rating- 
scales. And yet common agreement is the basis for the foot and the 
quart and the pound. 

The volume is really e1 indispensable handbook for evaluating 
particularly the equipment of Sunday schools but is much over the 
heads of the majority of Sunday school teachers and Superintendents. 
That should challenge the leadership in religious education. 


Dan .H. Kotp, II. 
Teachers College, 
Columbia University. 
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